1 EDUCATIONAL AND 


. Volume XXIII 
1963 


| DEM 
BOX 6907, COLLEGE STATION, DURHAM, 


q 
PSYCHOLOGICAL MEASUREME 


Editor: G. Frederic Kuder, Duke University 
Associate Editor: John A. Hornaday, Greensboro College 
Assistant Editor: Joan F. Hornaday 
Business Manager: Geraldine R. Thomas 


BOARD OF COOPERATING EDITORS 


Louis D. COHEN М. W. RICHARDSON 

University of Florida Richardson, Bellows, Henry and Co. 
HAROLD A. EDGERTON JOHN H. ROHRER 

Performance Research, Incorporated New Mezico 
Max D. ENGELHART Highlands University 

Chicago City Junior Colleges Davi SEGEL 

Indiana University 

E. B. GREENE 

Chrysler Corporation P. J. RULON 


TP Ganson Harvard University 


University of Southern California e e ARE 1 
E. F. LINDQUIST io State University 


State University of Iowa M. CIAR 
е W.E. Upjohn Institute 

оар ve бейге for Community Research 
ARDIE LUBIN THELMA G. THURSTONE 

Walter Reed Army Institute University of North Carolina 

о] Research HERBERT A. Тоорз 
SAMUEL MESSICK Ohio State University 

Educational Testing Service E. G. WILLIAMSON 
WILLIAM B. MICHAEL University of Minnesota 

University of California, Ben D. Woop 

Santa Barbara Columbia University 


‚ашыл temen Regent AVI rt i t se eti menses 


"5a: Edni Pay: Re EHE | Оовотнү ADKINS Woop 
b AES: ‘University of North Carolina 


» 4-19... 
dim y 


INDEX FOR VOLUME XXIII 


ABE, CLIFFORD (WITH James M. RICHARDS, Jr. AND Victor B. 
CLINE). Use of a Biographical Information Blank in the Pre- 
diction of Achievement in High School Science .......... 

AIKEN, Lewis R., Jr. The Grading Behavior of a College Fac- 
Cour uM TUER 

ALEXANDER, SHELDON (этн T. В. Ноѕек). The Effectiveness 
of the Anxiety Differential in Examination Stress Situations 

ANDERSON, Harry E., JR. (wIrH DONALD А. Leton). Optimum 
Grade Classification with the California Achievement Test 
се ОТОР T HO chee ane Renn 

BAKER, FRANK B. Generalized Item and Test Analysis Program 
—A Program for the Control Data 1604 Computer ........ 

Bass, BERNARD M. (wrrH GEORGE DuNTEMAN, ROLAND FRYE, 

e ROBERT VipULICH, AND HELEN WaAMBACH). Self, Interaction, 
and Task Orientation Inventory Scores Associated with 
Overt Behavior and Personal Factors ................... 

BAUBRNFEIND, ROBERT H. (wiru WARREN S. BLUMENFELD). A 
Comparison of Achievement Scores of Public-School and 
Catholic-School Pupils «oi soc vives жазана кл SURE 

BIANCHINI, JOHN (WITH Curt Srarrorp). Scoring Teacher- 
Made Tests with the IBM 1620 ............. eene 

BLUMENFELD, WARREN S. (wITH ROBERT Н. BAUERNFEIND). A 
Comparison of Achievement Scores of Public-School and 
Gatholic-School Pupils ........... OVO ete ШЕЕ 

Bore, WALTER В. GRE Aptitude Scores as Predictors of GPA 
for Graduate Students in Education .................... 

BUTLER, Јонх M. (wiru Dowar» W. Fiske). The Experi- 
mental Conditions for Measuring Individual Differences .. 

CAMPBELL, Davip Р. (wiru WAYNE W. Sorenson). Response 
Set on Interest Inventory Triads .......... eee 

CAMPBELL, DAVID. Another Attempt at Configural Scoring ... 

CassEr, RusseLL N. (with RENE De La Brranpats). FAST 
Processing of Psychological Tests Using High Speed Com- 
ЖИНИНИЗ: O NRG eee ee o A inco eate SLE 

CATHCART, ROBERT (witH WILLIAM B. MICHAEL, WAYNE S. 
ZIMMERMAN, AND Miro Mırurs). Gains in Various Measures 
of Communication Skills Relative to Three Curricular Pat- 
Teer un«xOollegel Teale жабу a E EE 

CLINE, Victor B. (wITH James M. RICHARDS, JR. AND CLIFFORD 


Ави). Use of a Biographical Information Blank in the Pre- ` 


diction of Achievement in High School Science .......... 


COHEN, JACOB (WITH ELMER L. STRUENING). Factorial Invari- 


iii 


135 


331 


365 


789 


iv INDEX 


ance and other Psychometric Characteristics of Five Opinions 
about Mental Illness Factors at Adds aes 
COMBS, ARTHUR W. (wiru DANIEL W. Soper AND CLIFFORD C. 
Courson). The Measurement of Self Concept and Self 
оро. DL РК EUN Spb ER VU. 
Courson, CLIFFORD C. (WITH ARTHUR W. COMBS AND DANIEL 
W. Ѕорев). The Measurement of Self Concept and Self 
APOE EEE ROS OOO Eddie e eee 
Cureton, Epwarp E. Note on Vocabulary Test Construction 
De La Brianpats, RENE (wiru RussELL N. CassEL). FAST 
Processing of Psychological Tests Using High Speed Com- 
Dulng Moses oro ИН ЙЕ аа... 
Dicken, CHARLES Е. Convergent and Discriminant Validity of 
the California Psychological Inventory ................ 
Dicken, CHARLES. Good Impression, Social Desirability, and 
Acquiescence as Suppressor Variables .................. 
Ders, CAROL J. (wiru ALLEN L. Epwanps). Neutral Items 
as a Measure of Acquiescence .......................... 
Doverass, Bruce (wiru WILLIAM B. MICHAEL, ROGER STEW- 
ART, AND J. Н. RAINWATER). An Experimental Determination 
of the Optimal Scoring Formula for a Highly-Speeded Test 
under Different Instructions Regarding Scoring Penalties 
DvBors, Рнплр Н. (ттн Grorcr Dovoras Mayo). Measure- 
ment of Gain in Leadership Training .................. 
* DUNTEMAN, GEORGE (ттн BERNARD M. Bass, ROLAND FRYE, 
ROBERT VIDULICH, AND HELEN Wampacu). Self, Interaction, 
and Task Orientation Inventory Scores Associated with 
Overt Behavior and Personal Factors .................. 
1 DURFLINGER, GLENN W. Personality Correlates of Success in 
ЕЕ РЕОН SEE SN LEC SOS ДА... к... 
s DURFLINGER, GLENN W., Academic and Personality Differences 
between Women Students Who Do Complete the Elemen- 
tary Teaching Credential Program and Those Who Do Not 
EDWARDS, ALLEN L. (WITH James A. Warsn). Relationships 
| ria Various Psychometric Properties of Personality 
TONGS) PPS SC SEE EAT WS CENE RN sm o rU MM 
EDWARDS, ALLEN L. (wiru Camon J. Ders). Neutral Items 
as a Measure of Acquiescence ........................... 
Еүрю, LORRAINE D. (wire Ropert S. Warpmop). Predictors 
of Scores on an Employment Counselor Selection Battery 
FELDMANN, SHIRLEY (WITH Max Werner). Validation Studies 
of a Reading Prognosis Test for Children of Lower and Mid- 
dle Socio-Economic Status ....... at eae eus... 
Fiske, Рохлір W. (with JOHN M. Burer). The Етрегі- 
mental Conditions for Measuring Individual Differences .. 


INDEX 


FLAUGHER, RONALD L. (этн Jum С. NUNNALLY AND WILLIAM 
F. Honors). Measurement of Semantic Habits ............ 
FLEISHMAN, EDWIN A. Factor Analysis of Physical Fitness 
Tests 7. . TEV 520 „АДЕ, DC OREL 
Foster, ROBERT J. (wrrH Ricard E. Scuutz). A Factor 
Analytic Study of Acquiescent and Extreme Response Set 
FnEEBURNE, CECIL M. (WITH A. Steven GIANNELL). The Com- 
parative Validity of the WAIS and the Stanford-Binet with 
College Ётё&8ЙӨТЇ. sc e OOO EIS 
FRENCH, JOHN W. Comparative Prediction of College Major- 
Field Grades by Pure-Factor Aptitude, Interest, and Per- 
sonality, Measures èss.. s. osia 4 OT 
e FRYE, ROLAND (wirH BERNARD M. Bass, GEORGE DUNTEMAN, 
ROBERT VipULICH, AND HELEN WAMBACH). Self, Interaction, 
and Task Orientation Inventory Scores Associated with 
Overt Behavior and Personal Factors ........... eee 
GERSHON, ARTHUR (wrrH WILLIAM B. MICHAEL AND RUSSELL 
Haney), Intellective and Non-Intellective Predictors of Suc- 
cess mi Nursing Training! sie ie. PORE EET SEES 
GIANNELL, A. STEVEN (wrrn Ceci, M. FREEBURNE). The Com- 
parative Validity of the WAIS and the Stanford-Binet with 
College Freshmen iouis dL ERES BALEARS 
Стговтт, JoserH PAUL. Relationship of High School Curriculum 
Experiences to College Grade Point Average ............ 
GoLDBERG, Lewis R. A Model of Item Ambiguity in Person- 
ality Assessment ST Dé LA RISE OCC IDEEN 
GUILFORD, J. P. Preparation of Item Scores for the Correla- 
tions Between Persons in a Q Factor Analysis .......... 
Haney, RussELL (wira WILLIAM B. MICHAEL AND ARTHUR 
GERSHON). Intellective and Non-Intellective Predictors of 
Success in Nursing Training... si ee seus Ка МОМ 
HANLON, Tuomas E. (уттн Mary HELEN Міснаох, Kay Y. 
Ora, AND ALBERT A. KURLAND). Rater Perseveration in 
Measurement of Patient Change ....................... 
Hosson, James R. High School Performance of Underage 
Pupils Initially Admitted to Kindergarten on the Basis of 
Physical and Psychological Examinations .............. 
Hopars, Witam F. (уттн Jum C. NUNNALLY AND RONALD 
L. FLAUGHER). Measurement of Semantic Habits ........ 
Hook, Marton E. (wire Јов H. Warp, Јв.). Application of 
an Hierarchical Grouping Procedure to a Problem of Group- 
BMT ORES acl Lec De oS RACE As 
Horn, Јонм. Second-Order Factors in Questionnaire Data .. 
Howanp, Kennetu I. Ratings of Projective Test Protocols as a 
Function of Degree of Inference ........................ 


557 


767 


101 


817 


557 


815 


467 


13 


817 


171 


vi INDEX 


Huser, Т. R. (WITH SHELDON ALEXANDER). The Effectiveness 
of the Anxiety Differential in Examination Stress Situations 
Jackson, Dovcras N. (witn LEE Sscurest). Deviant Re- 
sponse Tendencies: Their Measurement and Interpretation 
James, PATRICIA (WITH GEORGE Moen AND Byron W. WIGHT). 
Interest Correlations of the Wechsler Intelligence Scale for 
Children and Two Picture Vocabulary Tests ............ 
JASPEN, NATHAN. The Wherry-Doolittle Test Selection Method 
E ROTON SLR СОО ье o o RE 
JONES, ROBERT А. (ттн WILLIAM B. MrcHarr). Stability of 
Predictive Validities of High School Grades and of Scores 
on the Scholastic Aptitude Test of the College Entrance Ez- 
amination Board for Liberal Arts Students .............. 
KUDER, G. Freveric. A Rationale for Evaluating Interests .. 
KURLAND, ALBERT A. (wiru Mary Heren Micuaux, Kay Y. 
Ora, AND Tuomas E. HANLON). Rater Perseveration in 
Measurement of Patient Change ............ enne 
Lana, ROBERT E. (WITH ARDIE Lusin). The Effect of Correla- 
tion on the Repeated Measures Design................. 
Leron, DoNarp A. (witH Harry E. ANDERSON, Jn.). Optimum 
Grade Classification with the California Achievement Test 
Шау Е Re IE LL coc cn DM 
Lewis, ANrTA Zor. One-Way Analysis of Covariance Fortran 
FON ODT OF the EBM 7090 ae IN SEA. eite OEE 

s LıxpquısT, Е. F. An Evaluation of a Technique for Scaling 
High School Grades to Improve Prediction of College Success 
ТахбоЕв, James C. Multiple Scalogram Analysis: A Set-Theo- 
retic Model for Analyzing Dichotomous Items .......... 
Госкв, Epwin A. The Development of Criteria of Student 
Zamora men E EN EE 
Lorn, FnEDpERIC M. Cutting Scores ae Errors of Measurement 
А бБесойа Cass- era iaa СЕЛ Eras aaga Oa 
Lorp, Freprrtc M. Formula Scoring and Validity .......... 
Lusin, Arpre (wit ROBERT E. Lana). The Effect of Correla- 
tion on the Repeated Measures Design .................. 
MABERLY, Norman C. The Validity of the Graduate Record 
ena as Used with English-Speaking Foreign Stu- 
HEEE яза NISI о кь ба terete е REID 
Mayo, GroncE Dovcnas (wiru Pamir Н. DuBors). Meas- 
urement of Gain in Leadership Training ................ 
McQurrrv, Louis L. Rank Order Typal Analysis .......... 
McQuirry, Louis L. Best Classifying Every Individual at 
Every Leveled. ARR eee lp m D 
Metron, В. 8. (уттн Н. С. Оѕвовм). Prediction of Proficiency 
in a Modern and Traditional Course in Beginning Algebra 


INDEX 


MICHAEL, WILLIAM B. (wITH ROGER Srewart, Bruce Dous- 
LASS, AND J. Н. RAINWATER). An Experimental Determina- 
tion of the Optimal Scoring Formula for a Highly-Speeded 
Test under Different Instructions Regarding Scoring Pen- 


ZIMMERMAN, AND Miro Миз). Gains in Various Measures 
of Communication Skills Relative to Three Curricular Pat- 
terns in College ..... eee hh hh rtt 
Мїснлкь, WILLIAM B. (wits ROBERT А. Jones). Stability of 
Predictive Validities of High School Grades and of Scores 
on the Scholastic Aptitude Test of the College Entrance 
Examination Board for Liberal Arts Students .......... 
MICHAEL, WILLIAM В. (wrrg Mary L. Trnopyr). A Compari- 
son of Two Computer-Based Procedures of Orthogonal 
Analytic Rotation with a Graphical Method When a Gen- 
eral Factor is Present ............. 6." 
Micuarn, WILLIAM B. (wiru RUSSELL HANEY AND ARTHUR 
GERSHON). Intellective and Non-Intellective Predictors of 
Success in Nursing Training ет 
МіснА0х, Mary HELEN (этн Kay Y. Ora, Tuomas E. 
HANLON, AND ALBERT A. KURLAND). Rater Perseveration in 
Measurement of Patient Change ....... e tnn 
Мшез, Мпо (ттн WILLIAM B. MICHAEL, ROBERT CATHCART, 
AND WAYNE S. ZIMMERMAN). Gains in Various Measures of 
Communication Skills Relative to Three Curricular Patterns 
МЕЛ ЖИЕККЕ ГС een iret ras ЛА К КУ ee 
Moen, GEORGE (witH Byron W. WIGHT AND PATRICIA JAMES). 
Interest Correlations of the Wechsler Intelligence Scale for 
Children and Two Picture Vocabulary Tests ....++++++++ 
Morrison, Jack. The Comparative Effectiveness of Intellec- 
tive and Non-Intellective Measures in the Prediction of the 
Completion of a Major in Theater Arts ......+++++++0+4* 
Netson, Ricard C. Knowledge and Interests Concerning 
Sixteen Occupations Among Elementary and Secondary 
School Students .....0cececcccccecsncesccsvceresersess 
NEURINGER, CHARLES. The Form Equivalence Between the 
Wechsler-Bellevue Intelligence Scale, Form I and the Wech- 
sler Adult Intelligence 8са[е..........-+++ t 
NUNNALLY, Jum C. (wITH DONALD L. THISTLETHWAITE AND 
SuanoN Wore). Factored Scales for Measuring Character- 
istics of College Environments ...... eee 
NUNNALLY, Jum C. (WITH RONALD L. FLAUGHER AND WILLIAM 
F. Honors). Measurement of Semantic Habits .......+++ 
Oxtver, Tuomas С. (wir Warren К. Wis). A Study of 
the Validity of the Programmer Aptitude Test .....++-+- 


375 


587 


817 


171 


365 


359 


827 


741 


755 


239 
419 


viii INDEX 


OSBURN, H. С. (уттн В. S. Merton). Prediction of Proficiency 
in a Modern and Traditional Course in Beginning Algebra 
Ora, Kay Y. (wiru Mary HELEN MıcHAUx, Tuomas E. HAN- 
LON, AND ALBERT A. KURLAND). Rater Perseveration in 

— Measurement of Patient Change .......... а... 
PAUK, WALTER. Comparison of the Validities of Selected Test 
Procedures to Predict Shorthand Success................. 

®» PrMsLEUR, PAuL. Predicting Success in High School Foreign 
Language Courses ec ate TO LEER UR S ean eo a 
RAINWATER, J. Н. (ттн WILLIAM B. MICHAEL, ROGER STEW- 
ART, AND Bruce Dovarass). An Experimental Determination 
of the Optimal Scoring Formula for a Highly-Speeded Test 
under Different Instructions Regarding Scoring Penalties . . 
Rem, IAN E. (wirs KeNNETH Н. Woprkg, Norman E. War- 
LEN, AND RoBERT M. W. Travers). Patterns of Needs as Pre- 
dictors of Classroom Behavior of Teachers .............. 
RICHARDS, JAMES M., JR. (wiru VICTOR B. CLINE AND CLIF- 
FORD ABE). Use of a Biographical Information Blank in the 
Prediction of Achievement in High School Science ....... 
Scuurz, RICHARD E. (wira ROBERT J. Foster). A Factor 
Analytic Study of Acquiescent and Extreme Response Set 
Scuwarz, PAuL A. Adapting Tests to the Cultural Setting .. 
SECHREST, Lee (wir Doveras N. JACKSON). Deviant Re- 
sponse Tendencies: Their Measurement and Interpretation 
SECHREST, Len. Incremental Validity: A Recommendation .. 
Soper, Dante W. (wir Автнов W., Comps AND CLIFFORD C. 
PP CUN The Measurement of Self Concept and Self 
epor 


ee о о о5о о 50 0 ое are a i ir ir i о 
i ЭЭөө95 


277 


171 
831 


349 


83 


569 


789 


435 
673 


33 
158 
493 
323 
145 


289 - 


ج 


INDEX 


TERWILLIGER, JAMES S. Dimensions of Occupational Preference 
THISTLETHWAITE, DONALD L. (WITH JUM C. NUNNALLY AND 
Suaron Wore). Factored Scales for Measuring Character- 
istics of College Environments ..... ee eene 
Travers, ROBERT М. W. (WITH KENNETH Н. ортке, IAN E. 
REID, AND NORMAN E. WALLEN). Patterns of Needs as Pre- 
dictors of Classroom Behavior of Teachers .....+++++++++ 
VIDULICH, ROBERT (WITE BERNARD M. Bass, GEORGE DUNTE- 
MAN, ROLAND FRYE, AND HELEN WAMEBACH). Self, Interac- 
tion, and Task Orientation Inventory Scores Associated with 
Overt Behavior and Personal Factors ..... «een 
Wa prop, ROBERT S. (WITH LORRAINE D. Eype). Predictors 
of Scores on an Employment Counselor Selection Battery . . 
WALLEN, Norman E. (WITH KENNETH Н. Woprkz, Ian E. 
REID, AND Вовквт М. W. Travers). Patterns of Needs as 
Predictors of Classroom Behavior of Teachers .........- 
WarsH, James A. (wrrH ALLEN L. Epwanps). Relationships 
between Various Psychometric Properties of Personality 


MAN, ROLAND FRYE, AND ROBERT VipvLicH). Self, Interac- 
tion, and Task Orientation Inventory Scores Associated with 
Overt Behavior and Personal Factors .......«tnn 
Warp, Ток H., JR. (wiru MARION E. Hoox). Application of 
an Hierarchical Grouping Procedure to a Problem of Group- 
ing Profiles drre isnan dn N oon dia Е 
Werner, Max (WITH SHIRLEY FELDMANN). Validation Studies 
` of a Reading Prognosis Test for Children of Lower and Mid- 
dle Socio-Economic Status ......... mtm. 
Wiaut, Byron W. (wır GEORGE MOED AND PATRICIA JAMES). 
Interest Correlations of the Wechsler Intelligence Scale for 
Children and Two Picture Vocabulary Tests .....+-+++-+ 
WILLIS, Warren К. (wiru THOMAS C. Outver). A Study of 
the Validity of the Programmer Aptitude Test ....++--++ 
WoDTKE, KENNETH Н. (wiru IAN E. Rem, NORMAN E. Wat- 
LEN, AND RosERT М. W. Travers). Patterns of Needs as 
Predictors of Classroom Behavior of Teachers .......... 
WOHLWILL, Јолснім F. The Measurement of Scalability for 
Non-Cumulative Items ........ eee eet 
WOLFE, SHARON (with Jum C. NUNNALLY AND Dowarp L. 
THISTLETHWAITE). Factored Scales for Measuring Charac- 
teristics of College Environments ....... enn 
ZIMMERMAN, Wayne S. (wiru WILLIAM B. MICHAEL, RoBERT 
CATHCART, AND Mito Миз). Gains in Various Measures of 
Communication Skills Relative to Three Curricular Patterns 
@ Соё aa aan mere + „++ ое тато P INS TERRE 


їх 
525 


289 


569 


101 


799 


569 


227 


101 


359 


Statement required by the act of October 23, 1962 showing the ownership, management, 
circulation of the EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
lished quarterly at 2901 Byrdhill Road, Richmond, Virginia 23205. 


1. The names and addresses of the publisher, editor, and managing editor are: Publish 
G. Frederic Kuder, Box 6907 College Station, Durham, N. C. 27708. Editor, G. F 
Kuder. Managing Editor, Geraldine R. Thomas, 3121 Cheek Road, Durham, N. C. 2 0 


2. The owner is: (If owned by a corporation, its name and address must be stated and al 
immediately thereunder the names and addresses of stockholders owning or holding 1 
cent or more of total amount of stock. If not owned by a corporation, the names and 
dresses of the individual owners must be given. If owned by a partnership or other 
corporated firm, its name and address, as well as that of each individual member, must 
given.) G. Frederic Kuder, 2516 Perkins Road, Durham, N. C. 


3. The known bondholders, mortagees, and other security holders owning or holding 1 per 
cent or more of total amount of bonds, mortgages, or other securities are: (If there are nom 
so state.) None. 

I certify that the statements made by me are correct and complete. 


[Seal] 
Geraldine R. Thomas 


EDUCATIONAL and 
PSYCHOLOGICAL 


Editor: G. Frederic Kuder, Duke University 
Associate Editor: John A. Hornaday, Greensboro College 
Assistant Editor: Joan F. Hornaday 
Business Manager: Geraldine R. Thomas 


BOARD OF COOPERATING EDITORS 


Lovis D. COHEN M. W. RICHARDSON 
University of Florida Richardson, Bellows, Henry and Co. 
HAROLD A. EDGERTON Joun Н. RoHRER 
Performance Research, Incorporated Georgetown University 
Max D. ENGELHART School о] Medicine 
Chicago City Junior Colleges P. J. RULON 
BAR GREENE | Harvard University 
Chrysler Corporation * DAVID SEGEL 


J, Р. GULTORD Indiana University. 


` University of Southern California С. L. SHARTLE c 
E. F. LINDQUIST Ohio State University 


State University of Iowa H. C. TAYLOR 
Freperic M. LORD The W. E. Upjohn Institute for 


1 р 6 
Educational Testing Service Community Researc 
THELMA С. THURSTONE 


Arpip LUBIN "iege В 
Walter Reed Army Institute University of North Carolina 
of Research HERBERT А. Тоорѕ 

SAMUEL MESSICK Ohio State University 
Educational Testing Service E. G. WILLIAMSON 

WILLIAM B. MICHAEL University of Minnesota 
University of California, BEN D. Woop 
Santa Barbara Columbia University 


DoROTHY ADKINS Woop 
University of North Carolina 


'OLUME TWENTY-THREE, NUMBER ONE, SPRING, 


MEASUREMENT 


1963 


"-— Cp 
JG". uu E 
TO, ч T fe 


\ 


EDUCATIONAL AND PsYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 1, 1963 


A RATIONALE FOR EVALUATING INTERESTS 


G. FREDERIC KUDER 
Duke University 


A good many years have passed since Hull (1928) predicted 
that some day there would be a system for automatically organizing 
all pertinent data about a person into the best possible basis for 
choosing a life's work. It can hardly be supposed that Hull had 
electronic computers in mind, but it is true that the advent of these 
magic machines has made his idea appear more realistic for at least 
some portions of the job. The effect of the machines is more pro- 
found than merely speeding up old procedures. They have opened 
the way to more adequate formulations of old problems. 

The approach described in these pages is directed at one impor- 
tant aspect of the problem of helping young people who are in 
search of suitable occupations. The availability of computers makes 
it feasible. Although it is generally applicable to any data which can 
be collected concerning people, it appears to be particularly appro- 
priate at this time to the field of interests. 

Over the years, many approaches have been made to the problem 
of evaluating interests. The writer has suggested elsewhere (1954) 
that a logical start is the design and construction of an interest 
inventory with items fairly evenly distributed throughout the do- 
main they represent, and that the selection of the items for the 
final instrument must rest on more than an impressionistic basis. 
The development of such an instrument necessarily calls for trying 
out and analyzing a succession of collections of items and for deter- 


4 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


mining in the process the relation of every experimental item to 
а number of reference vectors. | 


Strong's System 


Of course the next step involves making the most of the responses 
{о the interest inventory which has been developed. The method 
devised by Strong almost forty years ago has become such an in- 
tegral part of measurement theory and ‘practice that there may be 
a tendency to forget that the use of a general reference group rep- 
resented at the time an ingenious solution to a knotty problem. 
There may also be a tendency to regard the method as the ulti- 
mate solution. E 

The use of a general reference group, while apparently solving 
one problem, raised another troublesome one, namely, what the. 
composition of such a group should be. It might be supposed that 
a sample of the general population would be most logical, but | 
Strong (1943) soon discovered that scales based on a reference 
group of this sort did not give good differentiation among profes- и 
sional groups. His studies indieate that there is no single reference 
group which is satisfactory for the whole range of occupations. 
has demonstrated that a scale developed for a certain occupati: 
with respect to one reference group may have little correlation with 
another seale developed for that same occupation with respect to 
another reference group. When discussing keys for women’s occupa- _ 
tions, Strong (1959, p. 23) says, “Maximum differentiation among. 
occupations is obtained when the women-in-general group is located | 
in the center of the occupations." 


An Alternate Formulation 


Strong's principle is no doubt valid when a general reference 
group is used. In the absolute sense, however, maximum differentia- 
tion among occupations is obtained when the most, effective device. 
is provided for differentiating between the occupations in each of. 
the many possible pairs. This may appear to be a preposterously 
erek a and impractical goal. Nevertheless it should be kept in. 
mind. 

Interest inventories are used mainly to help young people choose 
occupations they will find satisfying, and there is evidence 


С. FREDERIC KUDER 5 


such inventories can be useful in this respect.! 'The use of a men-in- 
general or women-in-general base for developing occupational scales 
implies that the choice is between a single occupation and а com- 
posite of all other occupations. But a lot in sharpness of differentia- 
tion is likely to be lost by this process. All types of engineers tend 
to get high scores on scales developed for any engineering specialty. 
All types of physicians tend to get high scores on scales developed 
for any particular kind of physician. "There is little or no differentia- 
tion within broad groupings of occupations. 

These results do not necessarily mean that a certain inventory 
cannot be used to differentiate occupations within these broad 
groups. They merely reflect the fact that responses which have large 
differences between the reference group and the broad group are not 
identieal with those which are best for differentiating among the 
specifie occupations in the broad group. Some attempts have been 
made to meet this situation by using averages of these broad groups 
for reference, or at least the averages for the broad group with the 
exception of the one occupation for which a key is being developed. 
If the reasoning involved is carried to its logical conclusion, the 
problem becomes one of differentiating among a number of specific 
occupations. And this is as it should be. The actual choice confront- 
ing the individual is almost always which of a number of specific 
occupations he should settle upon. 

The existence of electronic computers makes it possible to ap- 
proach this problem with some hope of success. The solution sug- 
gested in these pages is remarkably simple, although the process of 
arriving at it has not seemed so! The chief stumbling block has 
been the necessity for achieving a reorientation with respect to tra- 
ditional approaches. In a sense, the proposed solution is in the face 
of a trend toward simplification. At one time or another, Strong 
reduced the weights used in scoring the Strong Vocational Interest 
Blank, and he finally restricted them to a range of nine. Other 
investigators dealing with various instruments have demonstrated 


1 McRae's extensive follow-up study (1959) of 1164 young people over à 
period of about nine years is probably the most definitive investigation which 
has been made on this subject. McRae found that the likelihood of being 
dissatisfied with one's work is about a third as great for those who go into a 
line of work consistent with their Preference Record scores as for those who 
do not. These results are almost identical with those obtained by Lipsett and 
Wilson (1954) in an earlier follow-up study. 


6 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


that the trend can be carried to unit scoring with only slight effect 
on the degree of differentiation. 

In spite of the trend toward simple weights, let us consider using , 
fairly elaborate scoring. Let us use as response weights for differ- 
entiating between two occupations the actual differences between 
the proportions of the two groups (А and B) marking each particu- 
lar response. The formula for the weight to be assigned a response | 
then becomes 


Wa эк в DA — Ps. (1) 


If these weights are determined for all responses in the inventory, 
then the difference score X for any subject j becomes the sum of the 
weights assigned to all of the subject’s responses: 

X; = Xp, — Хрв, (2) 
the summation being over the responses of subject j. 
Possible Limitations 

This suggestion is subject to several reasonable objections. For 
one thing, it includes all responses in the key regardless of the sig- 
nificance of the differences involved. For another, it treats a differ- 
ence in the middle range of percentages as just as significant as the 
same difference between percentages near the extremes. Further, 
it calls for complex weights in the face of general evidence that 
elaborate weighting adds little if anything to the differentiating 
power of keys. It might also be held that, as long as the problem 
is being overhauled, intercorrelations of items should be included 
in the formulation. 

The general answer to these comments must be that the use of 
computers does not eliminate praetieal eonsiderations. It may be 
necessary to sacrifice certain apparently desirable features in order 
to do the job at all. Let us consider briefly each point in turn. 

The disadvantage of including in the key all responses, even those 
with differences which do not meet a predetermined PAR of sig- 
nificance, may not be particularly serious. A majority of the dif- 
"nose which do not quite meet the standard will still reflect 

real” differences; the responses with very small differences will 
not affect the relative size of the scores much anyway. It has been 
the writer’s experience that the number of items included in a key 


С. FREDERIC KUDER 7 


for Form D of the Preference Record can vary over a wide range 
without affecting the differentiation much (Kuder, 1957a). 

Of course there must not be an overwhelming number of items 
with insignificant differences. The distributions of the differences 
must be considerably greater than would be obtained by chance. As 
Cureton (1950) has so devastatingly pointed out, there has to be 
some validity in the items in order to develop a valid scale. In the 
case of Form D of the writer’s Preference Record, it is not uncom- 
mon for half of the items to have differences significant at the one 
per cent level of confidence when fairly large groups are involved. 

The effect of giving the same weights to the same differences, 
regardless of where they occur on the range of proportions, may 
also be of little consequence. By this time it is pretty well estab- 
lished that fairly large shifts in the weights assigned responses in 
long scales generally have little effect on validity and reliability. 

The criticism that the proposed scoring system is complicated is 
of little importance today. Electronic computers can handle large 
weights about as easily as small ones, A few years ago the procedure 
suggested would have been nothing less than absurd; today it is 
feasible. As a matter of fact, the system can be modified easily, if 
desired, so as to involve a range of weights and scores comparable 
to those of Strong.” 

What the effect of taking intercorrelations into account would 
be is difficult to estimate. Theoretically, the use of items fairly 
evenly spaced throughout the domain represented should keep any 
advantage from this source to a minimum. Using intercorrelations 
of items does not appear practical today, though it may well be 
tomorrow. 

Let us now consider the terms on the right in equation 2. The 
first term is the sum of the proportions of the members of Group A 
who marked the responses selected by subject j. This sum turns 
out to be the average score which would be obtained if all the peo- 
ple in Group A were scored using subject j's answers as the scoring 
stencil. Similarly, ps is the mean of Group B when scored by sten- 
cil j. These terms are referred to as “proportion scores” in the fol- 
lowing discussion, 


Results of Some Applications 


If it is desired to get an actual score for Occupation A as com- 


8 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


pared with Occupation B, or for any other pair of occupations, 
the differences сап of course be computed and norms established. 
If Group B were men-in-general, then the difference scores (X;) for 
a group of subjects should be closely related to scores on keys de- 
veloped by methods now in general use. 

А rough check has been made on this conclusion using a sample 
of 100 men from the base group for Form D of the Preference Rec- 
ord. At the time of going to press, keys for 61 occupations had been 
developed by a system which gives unit credit for each position on 
the key. The correlations between the scores from these 61 keys, 
and the corresponding difference scores described here range from 
68 to .99. The median is .93. Only one correlation is below .80. 
These generally high correlations have been obtained in spite of the 
fact that the scores used represent two extremes in weighting—unit 
eredit on the one hand, and differences in proportions carried to 
three places on the other. 

It is apparent, however, that the value of this approach is not in 
comparing occupational groups with a base group but in #0; paring 
occupational groups direetly with each other. The differences be- 
tween proportion scores for two occupational groups represent 8 
scale specifically designed to differentiate between those two groups. 
If the rationale is correct, this differentiation should be better than 
the differentiation obtained from scales developed through use of à 
general reference group. 

In order to get an idea of the results obtainable from the system 
described, it has been applied to some occupations for which fairly 
large eross-validation groups were available, the data being from 
Form D of the Preference Record. The response proportions used in | 
the scoring were from criterion groups ranging in number from 200 
to 400. Membership in the criterion and cross-validation groups was 
confined to people who liked their work? | 

Inspection of Table 1 reveals that the percentage of overlapping 
ranges from two to twenty per cent. The median of the entire. 
table is eight per cent. The degree of differentiation obtained is of 
an order seldom if ever reached before in the field of interests. 

The corresponding point-biserial correlations, also reported in 


2 Membership in these groups was further limited to men 25 to 65 years of 


age who had been employed for at least three years in the occupation under 
consideration. 


С. FREDERIC KUDER 9 


TABLE 1 


Overlapping of Difference Scores and Point-biserial 
Correlations for Pairs of Occupational Groups 
Not Used in Establishing the Scoring System* 
The entries above the diagonal are per cent of overlapping estimated by the 
method described by Tilton (1937). 
The entries below the diagonal are point-biserial correlations, using a formula 
which gives equal weight to each part of the dichotomy (Kuder, 1957b). 


Occupation 1 2 3 4 5 6 N 
1. Journalist 6% 6% 8% 3% 2% 92 
2. Architect .88 20% 6% 9% 1% 90 
3. Forester .88  .79 14% 16% 15% 100 
4. Professor of Psychology .87  .88  .83 9% 2% 100 
5. Pediatrician .90 .86 .82 .86 6% 90 
6. Automobile Mechanic 91. .85  .82 91 .88 100 


* The data on which this table is based are presented in Table 2. 


TABLE 2 
Means and Standard Deviations of Difference Scores 


The entries in each cell are descriptive of difference scores obtained by sub- 
tracting proportion scores for the row occupation from proportion scores for the 
column occupation. The first entry in a cell represents the mean of difference 
Scores for the column occupational group; the second entry is the standard 
deviation of the same scores. 


Occupation 1 2 3 4 5 6 
1. Journalist M 64.10 75.12 67.56 78.98 122.45 
с 34.29 39.56 34.14 28.61 56.96 
2. Architect M 78.96 68.20 70.94 68.59 112.75 
с 42.14 41.12 32.69 34.51 57.19 
3. Forester M 72.98 44.01 79.13 61.52 41.91 
с 38.63 46.43 42.10 37.86 36.32 
4. Professorof М 61.84 5264 59.21 36.82 129.49 
Psychology с 40.25 33.10 50.45 19.19 64.96 
5. Pediatrician М 71.85 43.47 55.49 33.30 102.95 
с 41.53 32.09 45.15 21.75 56.97 
6. Automobile М 126.38 84.61 58.08 146.76 106.83 
Mechanic — c 52.72 67.79 33.64 56.65 55.98 


10 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Table 1, range from .79 to .91 and have a median of .87. These are 
the correlations between difference scores and the dichotomous vari 
able formed by the two occupations involved. As applied to this 
situation, the correlation increases with the degree of differentiation 
achieved between the occupations. The results in Table 1 are based 
on the data presented in Table 2. It may be noted that differences 
below the diagonal in Table 2 have been computed in the opposite 
direction from those above the diagonal. Take, for example, the 
comparison of journalist and forester scores. Journalists obtain 
higher journalist than forester scores on the average, the mean dif- 
ference being 72.98. Foresters, on the other hand, obtain higher for- 
ester than journalist scores, the mean difference being 75.12, but if 
the differences were computed in the same direction as for journal- 
ists (journalist score minus forester score), this second mean differ- 
ence would be negative. The actual difference between the means 18 
148.10. 
It may be noted that the measure of overlapping is approximately 
twice the percentage of incorrect classifications. For example, the 
six per cent overlapping between journalists and architects reflects 
the fact that, on the basis of difference scores, about three per cent 
of journalists are incorrectly classified as architects, and about three 
per cent of architects are classified as journalists when the cutting 
point is halfway between the means. 


The Significance of Overlapping 


To what, extent is it possible or even desirable to obtain perfect 
differentiation between any two occupational groups on the basis ol 


* These point-biserial correlations should be multipli e com 
paring them with the biserial correlations reported fe се еа 0 
noted that the biserial correlation is not appropriate to this situation, аз сой 
trasted with point-biserial correlation. Apparently it is sometimes forgottel 
that the use of biserial correlation involves assumptions concerning the distri: 
butions of both variables. It is not only assumed that the dichotomous variabli 
represents an arbitrary cut in а normal distribution; it is also assumed thal 
the continuous variable is normally distributed. It is doubtful whether eithel 
assumption is justified when the data from two diverse groups are combine 
In the case of the continuous variable, visual inspection of the distributions 0 
the two groups involved will ordinarily reveal that a combined distribution й 
not likely to be normal. The greater the difference in means, the more di 
tinetly the combined distribution becomes bimodal. In the case of bimodal 
distributions of the continuous variable and equal numbers in each part of 
ш окшош variable, the upper limit of biserial correlation is 1.25 rath 

ап 1,00. 


С. FREDERIC KUDER 11 


interests in common activities? That almost perfect differentiation 
is possible in some cases is demonstrated by the figures reported in 
Table 1. But it may well be questioned whether the criterion is 
perfect in all cases. There is the possibility that there are some in- 
dustrial psychologists, for example, who would be just as happy, 
if not happier, as professors of psychology, and that some professors 
would have found work in industry more to their liking. Perhaps 
there are some machinists who would have found greater satisfac- 
tion as automobile mechanics, and some automobile mechanies who 
would have been happier as machinists. 

There may be no definitive way of checking on this idea. It does 
seem reasonable, however, that it would be a matter of indifference 
to some people in two similar occupations whether they engaged in 
one occupation or the other, and that the proportion involved would 
depend on the similarity of the occupations. For such people, differ- 
entiation according to the occupations they happened to enter could 
not be obtained; presumably they would receive relatively small 
difference scores between the two occupations on any instrument 
which might be devised for the purpose. On the other hand, there 
would probably be a number of people to whom it would matter a 
great deal whether they were in one occupation or the other. If 
these people could be identified, it might be possible to achieve а 
high degree of differentiation for them. When all members of each 
occupation are lumped together, the overlapping on the difference 
scores for discriminating between two occupations may be substan- 
tial. Even when there is considerable overlapping, however, almost 
perfect prediction is possible for people with large difference scores 
(the size of the necessary difference depending upon the degree of 
overlapping), and these people are likely to be the ones for whom 
the choice is most important. 

Further evaluation and refinement of this approach must await 
further studies. The scores obtained have some interesting properties 
which appear to deserve intensive exploration. It should be noted 
again that the technique is a general one. It appears to be particu- 
larly appropriate to tests and inventories which have been designed 
and constructed in such a way that, as in the case of Form D, the 
items are fairly evenly scattered throughout the domain they 
represent. 


12 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
REFERENCES 


Cureton, Edward E. “Validity, Reliability and Baloney." Epuca- 
TIONAL AND PSYCHOLOGICAL MEASUREMENT, X (1950), 94—96. 
Hull, Clark. Aptitude Testing. Yonkers, New York: World Book 

Company, 1928. 

Kuder, G. Frederic. ^Expected Developments in Interest and Per- 
sonality Inventories.” EDUCATIONAL AND PSYCHOLOGICAL MEAS- 
UREMENT, XIV (1954), 265-271. 

Kuder, G. Frederic. “A Comparative Study of Some Methods of 
Developing Occupational Keys." EDUCATIONAL AND PSYCHOLOGI- 
CAL MEASUREMENT, XVII (1957), 105-114. (a) 

Kuder, G. Frederic. Research Handbook for the Kuder Preference 
Record, Form D. Chicago: Science Research Associates, 1957. (b) 

Lipsett, Laurence and Wilson, James W. “Do Suitable Interests 
and Mental Ability Lead to Job Satisfaction?” EDUCATIONAL 
AND PSYCHOLOGICAL MEASUREMENT, XIV (1954), 373-380. 

McRae, С. G. “The Relationships of Job Satisfaction and Earlier 
Measured Interests." Unpublished Ph.D. thesis, University of 
Florida, Gainesville, Florida, 1959. 

Strong, Edward K., Jr. Vocational Interests of Men and Women. 
Stanford University, California: Stanford University Press, 1943. 

Strong, Edward K., Jr. Manual for the Strong Vocational Interest 
Blanks. Stanford University, California: Stanford University 

. Press, 1959. 

Tilton, J. W. “The Measurement of Overlapping.” Journal of Edu- 

cational Psychology, XXVIII (1937), 656-662, 


раа 


ас асва ЬЕ 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 1, 1963 


PREPARATION OF ITEM SCORES FOR THE 
CORRELATIONS BETWEEN PERSONS IN А Q FACTOR 
ANALYSIS! 


J. P. GUILFORD 
University of Southern California 


Tuts paper is mainly concerned with the question of an appropri- 
ate index of correlation between persons to be used in a Q factor 
analysis. When correlating dichotomous scores of 1 and 0, the chief 
difficulty is that considerable restraint is placed upon the indices of 
correlation due to variations of person means. A person’s mean, 
when scores are 1 and 0, is simply pa the proportion of unit scores 
he has. With increasing differences in the means of two persons, 
there is increasing restriction in possible range of correlation coeffici- 
ents from a 2 X 2 contingency table. 

Another disturbing effect occurs when, although the two person 
means approach equality, they are of extreme value. The more: 
closely the two means approach either 0 or 1, the greater the (posi- 
tive) correlation is likely to be. From their correlation coefficient, 
two people may appear to be very much alike just because they both 
answer positively many items that most people answer positively, 
or, in common, they fail to answer positively many items just as the 
great majority of individuals do. 

As a consequence of the latter principle, investigators who have 
employed coefficients such as the tetrachoric т or the phi coefficient 
have been plagued by the fact that two or more indivduals, who 


1 Based on a paper read at the annual meeting of the Society of Multivariate 
Experimental Psychologists, at Fort Worth, Texas, November 16, 1961. 


13 


14 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


similarly give unusually large numbers of positive responses to two- 
choice items of the test instrument, tend to share a common factor 
attributable in very large part to this cireumstance alone. Likewise, 
two or more indivduals who give large numbers of negative re- 
sponses may share a common factor. We might regard such factors 
as being in the same category with “difficulty factors" sometimes ob- 
tained in analysis of tests of ability by an R-model factor analysis. 
Such a Q factor might even be given a meaningful interpretation as 
representing a common acquiescence bias or response set. Or it 
might be a combination of this bias with a social-desirability bias, if 
positive responses are systematically socially favorable (or unfavor- 
able). 

If such easily recognized artifactual factors were the only conse- 
quence of the biased correlation coefficients, little damage would be 
done. But it is likely that such factors are not clearly separated from 
those having psychological significance and that the part of the fac- 
tor structure pertaining to the latter is adversely affected in terms 
of interpretational clarity. 

Of course, an experimental procedure that largely avoids this 
difficulty is a Q-sort experimental approach. But there are often rea- 
sons for preferring dichotomous responses to items over Q-sort re- 
Sponses, such as palatability of the task for the experimental sub- 
jects and the nature of the measuring instrument the investigator 
wants to use. The instrument may be composed of items to which 
one of two responses is required as a necessary feature, or the inves- 
tigator has some reason connected with his research hypothesis for 
keeping to this kind of item, 

In addition to the difficulties presented by differing person means 
and their effects upon correlation coefficients, there is the question 
of what kind of scores should be utilized for correlation purposes in 
a Q analysis. In a Q analysis, we should employ ipsative scores, for 
we are concerned with intra-individual variations among test vari- 
ables and not with inter-individual variations among persons, as in 
R-technique analysis. Scores of 1 and 0 for an item are on a norma- 
tive scale and consequently are not suitable for correlations between 
persons. The solution suggested in this paper is to ipsatize item 
scores, or at least to go far enough in that direction. It turns out 
that this procedure also offers a promising solution to the eccentricity 
problem; the problem arising from variation in person means. 


J. P. GUILFORD 15 


The Indices of Correlation Considered 


In order to emphasize the importance of the problem of eccen- 
tricity, let us consider a number of indices of correlation that are 
used or could be used in the correlations of persons, using the norma- 
tive scores of 1 and 0. Many of these indices will probably continue 
to have their appeal because of the ease with which they can be ap- 
plied to data in а 2 x 2 contingency table. But the problem of ec- 
centricity imposes restrictions in connection with R analyses as 
well as with Q analyses, hence it is worth our while to give these 
coefficients some general attention. 

The indices of correlation considered here include the phi co- 
efficient, or product-moment correlation in a 2 x 2 contingency 
table; the tetrachorie coefficient, estimated from the Thurstone 
diagrams (Cheshire, et al., 1938) ; the correlation from unlike signs; 
and an index I, proposed by B. J. Winer (Palmer & McCormick, 
1961) for dealing with checklist data. The latter statistic is given 


by the formula 
e TEE 
I= : 
a+b+e 


which takes the algebraic sign of the numerator term. The symbols 
a, b, and c pertain to the familiar cell frequencies in a 2 x 2 con- 
tingency table, a being the number of plus-plus agreements, b the 
number of minus-plus disagreements, and с the number of plus- 
minus disagreements. 


Deriving Item Scores with Ipsative Properties 


If we were dealing with total test scores, from tests differing in 
means and variances, we should not think of correlating persons 
without modifying the score matrix to make such correlations per- 
missible for a Q analysis, We would at least first transform the test 
scores to a common scale, achieving equal means and standard devi- 
ations in the population. A scale of stamdard scores would be appro- 
priate. Although not completely ipsative, such scores are modified 
sufficiently in that direction for correlational purposes in a Q analysis. 

Let us consider such a transformation for item scores of 1 and 0, 
which, as pointed out above, are normative scores. The mean for 


16 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


an item is the proportion of unity scores, pı, and the variance is the 
product р191, where qı = 1 — pı. The standard scores corresponding 
to raw scores of 1 and 0 are readily obtained from knowledge of pı, 
according to the following proofs: 

The general expression for a standard score is 
_Х-М, 
NM 


When an obtained item score X equals 1, ` 


xL p rer 
2 = = 
А Vid 


= a/p. 
When an obtained item score equals 0, 


2 


же 0– р. деш xy | 
rh Ут. 
= – Ур/а. 


In transforming а set of obtained item scores into z Scores, to be 
rid of negative signs, one might add the constant 5.0 to all values, 
obtaining probits, But there is at least one advantage in leaving the 
means of the item z scores at zero, in hand computations at least. It 
will be noted that the two z scores for any item are reciprocals with 
change of algebraic sign. The cross-product for such & pair of values 
is simply —1.0. The finding of the two standard-score values for any 


item is made very easy by tables that give those values correspond- 
ing to p or q (Guilford, 1954, 1956). 


Correlations of Persons with Relatively Extreme Means 


To illustrate some of the points mentioned above, let us use a small 


item-score matrix with 20 Persons and 20 items, Initially, the data 
were observed item scores deriv 


Table 1. In four instances, pai 
arbitrarily reversed in ord 


К. шд йышт. 


J. P. GUILFORD 1 


ТАВГЕ 1 


Obtained and Standard Item Scores for Four Persons 
on Twenty Items, with Item Means 


ч 


Obtained Item Item Standard Item Scores (г) 
Items Scores (X) for Persons Means for Persons 
B С р (m) A B С р 
1 1 1 1 0 .95 .23 23 .23 —4.36 
2 1 1 NM .30 1.53 1.53 —.66 —.66 
3 1 1 1 1 .95 .28 .23 .23 23 
4 0 1 1 1 .90 —3.00 .33 .98 33 
5 1 1 0 0 .25 1.73 1.78 —.58  —.58 
6 1 1 0 0 .30 1.53 1.50  —.00 —.66 
7 1 1 0 [U .25 1.73 1.73 —.68  —.58 
8 9.7 0 0 .20 —.50 —.50 —.50 —.50 
9 1 1 0 0 .70 .66 .66 —1.53 —1.53 
10 1 1 0 1 .65 ‚78 .78 —1.36 78 
11 1 1 1 0 .75 .58 .58 .58 —1.73 
12 0 1 0 0 .50 —1.00 1.00 —1.00 —1.00 
13 1 1 0 0 .15 2.388 2.388  —.42  —.42 
14 0 0 1 1 .65 —1.36 —1.36 .78 .78 
15 1 1 0 0 .40 1.22 1.22 .-.82 —.82 
16 1 1 0 1 45 1.11 1.11  —.90 1.11 
17 1 0 0 1 65 .73 —1.36 1.36 ‚78 
18 1 1 0 0 60 .82 .82 —1.22 —1.22 
19 0 0 0 0 10 —.393 —.33 —.33 —.33 
20 1 1 0 0 10 3.00 3.00 -—.33 —.33 


pi 75 80 .25 .30 


leading to zero frequencies in contingency tables. The standard 
Scores corresponding to the obtained scores appear at the right in 
Table 1. The intercorrelations of persons using these standard scores 
will be symbolized by rs. 

There should be considerably more freedom from dichotomous 
scoring in computing the coefficient rs. Instead of two scores, 1 and 0, 
for each person the standardizing process yields wide distributions 
of values (see columns at the right in Table 1). The freedom is not 
complete, for the information after all was obtained in two cate- 
gories. Although only two values remain for each item, the values 
of those two differ from item to item, depending upon the item mean. 
A double-centered score matrix (with equal means and variances for 
persons as well as for items) is not achieved by using z scores, but 
this is not necessary. 

Because of strong similarities in extreme means, we should expect 
the correlations r45 and rop to be substantial and positive. Because 


18 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


of marked dissimilarities in means, the other four correlations should © 
be very low or negative. These expectations are borne out for cor- 
relations of the type ¢, ri, Tu and I, without exception, as shown in 
Table 2.2 The coefficient r, is obtained by a Pearson product-moment 
correlation of persons’ z scores. This coefficient shows some tendency 
in the same systematic directions, as seen in Table 2, but there are 
two notable exceptions. One is that, in spite of the large numbers of 
negative 2 scores in common to persons С and D, the r, for those two 
persons is slightly negative, rather than positive. The other is that 
the correlations between unlike pairs of persons, although negative, 
are much closer to zero and they vary more in size, as compared with 
the corresponding coefficients т; and Tu- 

If the 20 items were regarded as a sample, we should not, of course, - 
be very confident about comparisons of coefficients. But we may re ] 
gard the 20 items here as composing a small population, in which 
there are no sampling errors. The indices of correlation appear to- 
perform in ways that illustrate expectations from known statistical 
principles. 


Amount of Restriction Effects in Terms of Ranges 


Although the obtained values in Table 2 yield results to be ex- 
pected, we can gain a better indication of the extent of restraints im- | 
posed by marginal means by examining the range of possible values | 


TABLE 2 
Intercorrelations of Four Persons Using Different 
Indices of Correlation or Agreement 
Variables Type of Index* 

Correlated $ Ti Ta 1 Ts 
A with B .58 .82 .89 86 .70 
A with C —.20 —.84 —.59 —.49 — .36 
A with D —.13 —.22 —.45 —.38 —.06 
B with C .00 .00 —.45 —.38 —.25 
B with D —.22 —.88 —.59 —.41 —.10 
C with D .88 .60 т .25 —.20 


* = product-moment correlation of obtained scores. 
rı = tetrachoric r, from Thurstone Diagrams. 
ты = correlation based on proportion of unlike-signed cases, 
I = Winer's index of agreement. 
т, = product-moment r based on standard scores. 


2 The tetrachoric coefficients were estimated by means of the Thurstone 
a Cosine-pi estimates were usually within 02 of the graphic 
estimates. 


J. P. GUILFORD 19 


TABLE 3 


Minimum and Maximum Possible Values for the Various Indices of 
Correlation, Given Certain Combinations of Person Means 


Variables Type of Index* 
Corre- 


Ф Ti Tu I т, 
lated Min. Max. Min. Max. Min. Max. Min. Max Min. Max. 


AwithB —.29 +.87  —.8 +10 +.16 +.99 +.57 4.97  —.03 +.85 
AwithC —1.00--.33 —1.0 +8 —1.00 .00 —.71 .00 —1,00 4-24 
AwithD —.88 +.38 -10 +8 —.99 —16 —.65 4.32  —.32 +64 
BwihC —.87 +.29 —10 +8 -.99 —16 —.65 —18  —.28 +74 
B with D —.76 4.33 —10 +8 -.95 .00 —.71 4-25  —.28 +.59 
CwithD  —.38 4.88 —8 0 —16 +.99 —71+87 —.72 4.92 


* АП indices defined the same as in Table 2. 


in each case. In other words, what are the maximum and minimum 
values for each index, given the particular combinations of person 
means? For any pair of marginal means it is possible to rearrange 
the obtained item scores for each person so as to minimize or to 
maximize the index value. The results for the five indices are given 
in Table 3. 

There has been а common recognition that the maximal $ for 
any combination of means decreases as degree of eccentricity in- 
creases. In the context of the problem under discussion, there should 
also be concern regarding the minimum ¢. The ranges of ¢ given in 
the table show that in each case $ has both positive and negative 
values, but that sometimes a maximum or a minimum ¢ is not far 
from zero. 

Sometimes investigators have used the ratio $/dmas. instead of 
¢ in factor analyses. Application of this ratio here would simply re- 
sult in the more extreme values of .67, —.60, —.34, .00, —.67, and 
48, respectively. It is clear that the dependence of the ф/Фтаг. ratio 
upon eccentricity is only exaggerated. Incidentally, it can be sug- 
gested that if such a ratio is to be used, for negative correlations the 
ratio $/dmin, Should be substituted for ¢/dmaz, Where min. is also 
negative. 

The limiting values of r, are found to be —0.8 and +1.0 in some 
cases and —1.0 and +0.8 in others. If the cosine-pi approximation 
to r; were used, the range would be —1.0 to +1.0 in all instances, 
since there was a zero in one cell of every rearranged contingency 
table. Since it is easy to obtain a zero. in one cell by optimal re- 
arrangement of cases within cells, these ranges are highly deceptive; 


20 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


they suggest greater freedom from effects of eccentricity for т; than 
is justified. Actually in such unusual contingency tables, one could 
well question the legitimacy of applying т, at all. Some of these tables 
suggest lack of genuine linearity of relationship and others suggest 
lack of homoscedasticity. 

The ranges of r, that are possible in the six correlations involved 
are clearly dependent on the combinations of person means. In five 
cases the ranges are on one side of zero and in the sixth case Ње 
range is almost as restricted. 

Index I has the general lower limit of —.707, as stated before, 
regardless of degree of eccentricity. This limit is reached in three . 
instances in Table 3. In three instances the ranges are limited to 
one side of zero. 

The limits of ra, the correlation from standard scores, are not easy — 
to estimate. After interchanging a certain number of scores, as in 
the case of the other indices, we also have to consider which scores 
should be interchanged to accomplish either а minimum or maxi- 
mum. The limiting values given in Table 3 for т, may not be the — 
most extreme possible correlations. Even so, the ranges are surpris- 
ingly small. Further exploration needs to be done as to the nature 
of the restriction that remains and whether it is descriptive of some- 
thing we want to describe. 


Discussion 

One way to use dichotomous scores and still avoid problems aris- | 
ing from eccentricity, of course, would be to select items whose 
means have similar limitations. Unfortunately, items that we want 
to use persist in having different means. Individuals also will have 
different means. Restricting ourselves to items and persons with | 
means near .5, for example, would mean an intolerable loss of 
otherwise usable material. 

Although the index r, involves much more work than do those 
indices based on scores of 1 and 0, computational aids make its use 
feasible. As stated, standard scores for an item can easily be found 
in tables, and the two z scores for an item remain the same over all 
persons. After all persons are assigned z scores for all items, the 
labor is no greater than for any score matrix from which an ordinary | 
r is computed. 

Even with the use of this correlational statistic, it would be well 


J. P. GUILFORD 21 


to avoid using items with means very near zero ог опе, in which 
ease one of the z scores is relatively extreme. With these various 
considerations, the proposed coefficient r, would seem to have a basis 
for priority in Q analysis. 

One remaining uncertainty pertains to the shape of the distribu- 
tion of z scores for persons. Distributions should be inspected for 
cases of extreme skewing or truncation and for bimodality. Under 
these conditions one might dichotomize the distribution of 2 scores 
for each person. Having a continuous scale of scores for each person, 
it should be relatively easy to dichotomize the distributions close to 
their medians, in which case the familiar indices of correlation 
would apply. 

Earlier it was mentioned that eccentricity is also a problem in 
R-technique analysis of items. Transformation of obtained scores to 
standard scores would not be the solution, for this would only change 
means and standard deviations of items. Items would intereorrelate 
just the same regardless of such transformations. Going even further, 
i.e., transforming scores so as to equate means and standard devia- 
tions of individuals also, thus producing a doubly-centered matrix, 
would mean that we use ipsative scores in an R analysis. The “fac- 
tors” thus obtained would not be the same as those obtained from 
normative scores. 

Some other kind of solution is therefore needed to take care of 
the eccentricity problem in an R analysis of item scores. Earlier it 
was suggested that only items giving minimal degrees of eccentricity 
could be used, if one can afford to be so selective with regard to 
items. As a substitute for item scores, variates can be derived in 
the form of scores from small combinations, each composed of 
homogeneous items. The distributions of such scores for a variate 
can often be dichotomized, if necessary, within the neighborhood of 
the medians (for example, see Guilford & Zimmerman, 1956). 

Procedures for controlling marginal frequencies in responses to 
single items have been proposed recently by Willis (1960). Subjects 
respond by marking on a line. From such information one can cut 
distributions of item scores close to medians, either for items or for 
persons. In the latter instance, ipsative scores would be approached, 
if not achieved? 


з The article by Willis came to the author's attention after this paper was 
written. 


22 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Summary 


A Q factor analysis based on items scored 1 and 0 encounters two 
difficulties. One is that eccentricity (differences in pairs of item - 
means or in pairs of person means) is a source of serious biases in 
indices of correlation. The other is that obtained item scores of 1 
and 0 are normative, where ipsative scores are needed in a Q 
analysis. 

The latter diffeulty is readily met by transforming the obtained 
scores into corresponding standard scores. Some formulas are pre- 
sented for this purpose. It is suggested that the correlation of stand- 
ard scores for persons also helps to solve the problem of biased 
coefficients. | 

With the aid of some illustrative data, the effects of eccentricity 
upon various utilized indices of correlation were demonstrated. The 
new statistic т, compares very favorably with the more familiar 
correlation coefficients, and can be recommended in Q analysis, 
where distributions of persons’ standard scores are sufficiently - 
regular. 


REFERENCES 


Cheshire, L., Saffir, M., and Thurstone, L. Т. Computing Diagrams 
for the Tetrachoric Correlation Coefficient. Chicago: University 
of Chicago Bookstore, 1938. 

Guilford, J. P. Psychometric Methods. (Second Edition) New York: 
McGraw-Hill, 1954. : 
Guilford, J. P. Fundamental Statistics in Psychology and Education. | 

(Third Edition) New York: McGraw-Hill, 1956. | 
Guilford, J. P. and Zimmerman, W. S. “Fourteen Dimensions of - 
M NR" Psychological Monographs, LXX. (1956), Whole | 
o. 417. 
Palmer, С. A., Jr. and McCormick, E. J. “A Factor Analysis of Job 
AD Journal of Applied Psychology, XLV (1961), 289%- - 
Willis, R. H. "Manipulation of Item Marginal Frequencies by Means 
of Multiple-Response Items.” Psychological Review, LXVII 
(1960) , 32-50. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 1, 1963 


MEASUREMENT OF GAIN IN LEADERSHIP TRAINING! 


GEORGE DOUGLAS MAYO 
Naval Air Technical Training Command 


AND 
PHILIP H. DuBOIS 
Washington University 


Охе of the basic problems in both educational and psychological 
research is the measurement of change in human behavior. In many 
instances adequate measures of initial status and final status are 
difficult to obtain. But even when these measures are available, їп- 
crements derived by subtracting initial score from current or final 
score generally have proved unsatisfactory (Brookover, 1945; Wood- 
row, 1946). Among the difficulties associated with this type of gain 
is its tendency to correlate approximately zero with variables with 
which an adequate measure of gain logically should correlate signifi- 
cantly. Further difficulties arise from the fact that a test used as а 
measure of initial status frequently is not appropriate as a measure 
of final status, and when a different test is used, scores are often ex- 
pressed in a different metric. In addition, the negative correlations 
usually found between gain and initial status are suggestive of a 
statistical artifact. 

DuBois (1957) has described a gain measure which purports to 
correct some of these criticisms. It is called “residual gain” since it 
is the residual which remains when the variance, which the initial 
score has in common with the final score, is partialed out of the final 
score. Manning (1961) accords Thorndike, Bregman, Tilton and 


1 The statistical methods employed in the study were formulated in part 
under ONR Contract No. Nonr 816(02). The study proper was Sp by 
the Chief of Naval Air Technical Training. The opinions expressed, d 
" RN of the writers and are not necessarily shared by the Department 

e Navy. 


23 


24 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Woodyard (1928) credit for the first research employing residual 
gain. 

The following favorable characteristics have been attributed to 
residual gain. It systematically correlates better with other vari- 
ables, including aptitude and intelligence tests, than does the alge- 
braic difference between initial and final measures. Previous re- 
search in which inconclusive results were reported give evidence of 
significant relationships when the residual gain criterion is used. Use 
of residual gain imposes no requirement that the initial and final 
measure be expressed in the same metric. When residual gains in 
Several tasks are correlated, а general factor tends to emerge—a 
condition which agrees with logical considerations. Finally, residual 
gain constitutes an operationally defined measure of gain which cor- 
relates zero with initial status. 

To date encouraging results have been achieved from the use of 
residual gain as a criterion in test validation (Manning & DuBois, 
1958), and similar statistical procedures, involving multiple-partial 
correlation, have achieved a degree of success in the study of moti- 
vation measurement, (Mayo & Manning, 1961). The present study 
undertakes, by means of residual gain and related correlational pro- 
cedures, to measure gain in leadership performance. 


Procedure 


The Naval Air Training Command conducts a leadership school 
for chief petty officers assigned to duty in the 48 training or support 
activities in the command. The 5-weeks course enrolls a new class 
of approximately 60 chief petty officers every six weeks. Special em- 
phasis is placed upon military leadership in terms of personal ex- 
ample, moral responsibility, and good management practices. In the 
formulation of the curriculum it was necessary to make a decision 
as to whether the human relations aspect of leadership or the mili- 
tary aspect of leadership would be emphasized more. The decision 
was made in favor of the latter. This resulted in the “atmosphere” 
of the school being much more like that of a strict military academy 
than like that of the usual human relations type of training found 
in industry. 

Six consecutive classes were involved in the study. The first en- 
tered training in April, 1960, and the last class was graduated in 
December, 1960. Of the approximately 360 chief petty officers who 


MAYO AND DvuBOIS 25 


were assigned to the school during this period, complete initial and 
follow-up data were obtained for 211. The design of the study called 
for an evaluation of the leadership performance of each graduate of 
the school by the commissioned officer who supervised his work prior 
to assignment to the school and after he had been back on the job 
for & period of two months. Most of the loss in numbers from 360 
to 211 resulted either from the supervising officer being transferred 
to a new duty station or from the chief petty officer being assigned 
to new duties upon returning to his original duty station. A careful 
review of the selection that took place in this reduction in N gave no 
indication that the selection was related to leadership performance. 

At the time this information was collected, each supervising officer 
was asked to provide the same information concerning a chief petty 
officer who did not attend the Chief Petty Officer Leadership School. 
These were to act as a control group for comparison with the chief 
petty officers who attended the school. In order to make the evalua- 
tions as objective as possible, a special form was devised for use by 
the supervising officers. It consisted of a 15-point scale with two 
anchor or reference points. The first reference point was located at 
the fourth point on the scale. The directions required that the name 
of one of the most outstanding chief petty officers, in terms of lead- 
ership, the supervising officer had known during the past five years 
be entered on a line provided at this point. In similar fashion at the 
twelfth point of the 15-point scale, the supervising officer was asked 
to enter the name of a chief petty officer whom he had known during 
the past five years who was just barely adequate as a leader. For 
purposes of the study, effective leadership behavior was defined as 
“effectively organizing, directing, and obtaining the cooperation of 
others in an atmosphere of mutual trust, respect, and moral responsi- 
bility.” 

Having establishing the two anchor or reference points just de- 
scribed, the supervising officers were then asked to evaluate the 
leadership performance of the man who was assigned to the Chief 
Petty Officer Leadership School in comparison with the two men 
whom they had identified as representative of the two defined levels 
of leadership. The same procedure was followed on a separate copy 
of the form for the man in the control group. The same chief petty 
officers were used as reference points on the two copies of the form, 
both before and after training. Copies of the form were not provided 


26 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


for retention by the supervising officer, as it was desired that » 
second evaluation, conducted two months after graduation, be а 
independent evaluation of performance. Both evaluations were аб: 
complished by direct, official correspondence between the Chief of 
Naval Air Technical Training and the supervising officer. 
The two evaluations of the control group, approximately 13 weeks: 
apart, afforded an opportunity to obtain a test-retest estimate 0! 
the reliability of the supervising officer evaluations. This reliability: 
coefficient was .78. Inspection of the means and standard deviations 
of the members of the experimental group and the control group 
showed the two groups to be highly comparable. 
The measure used as the criterion of gain in leadership perform- 
ance was the residual which resulted from partialing, from the sec- 
ond evaluation made by the supervising officer, the variance in com- 
mon with the first evaluation. Stated differently, the residual gain 
criterion consisted of the part of the supervising officer's evaluation 
made after the graduate of the school had been back on the job for 
` two months, which was linearly unrelated to the evaluation made at 
the time the student entered the school. 
In addition to the initial and final supervising officer evaluations 
for each graduate of the school, certain other measures were also 
considered. One of these measures was peer ratings on gain wh 
in the school. Classmates were asked to carefully consider both 
initial status and present level of leadership knowledge and skill о 
the members of their platoon, approximately 20 men, and then 
write the names of the three men whom they believed had gained 
the most from the school. It was pointed out that, since all student 
do not start at the same level, it is possible for a man who is not at 
the top of the class to gain more from the school than a man who ù 
at the top of the class. The peer ratings were collected during he 
last week of the course. A second measure of gain was based u 
a self-rating measure, which was included in a questionnaire § 
to each of the graduates of the school at the time the second eval 
tion was made by the supervising officer. This item asked, “То w! 
extent has your leadership behavior changed in a positive or favor- 
able way as a result of attending the school?” A five-point scale Wa 
provided ranging from the term “markedly” to “none.” 
A third measure of gain was derived from a written test covering 
the content of the course and the application of leadership principles: 


MAYO AND DvBOIS 7 


This test of 100 items, given upon reporting to the school and again 
at the end of the school, will be referred to as the pre-test and the 
end-test. The measure of gain was the residual which remained 
when the variance the pre-test had in common with the end-test was 
partialed out of the end-test variance. A fourth measure of gain was 
derived in a similar manner by partialing the pre-test variance out 
of the Chief Petty Officer Leadership School grade. 

Other variables included in the study were the Navy General 
Classification Test, (GCT), which is essentially a verbal intelligence 
test, years of civilian education, age, and final or over-all grade 
made in the school. A final variable was a brief authoritarian-equal- 
itarian (A-E) measure consisting of seven items, developed by San- 
ford and Older (1950). The test was taken during the first few days 
the students were at the school and was administered as an "opinion 
poll." It was included largely on a trial basis to gain an impression 
as to whether it would be worthwhile to conduct а more compre- 
hensive study relating authoritarianism to leadership in the chief 
petty officer setting. 


Results 


A principal question which the study sought to answer pertained 
to the effectiveness of the Chief Petty Officer Leadership School with 
respect to improving leadership performance on the job, and to the 
correlates of gain in leadership behavior. The point biserial correla- 
tion between the dichotomous variable of attendance versus non- 
attendance at the school with the residual gain criterion was 34, 
which with an N of 422 is significant well beyond the .01 level. 

Intercorrelations among the variables involved in the study, to- 
gether with means and standard deviations, are shown in Table 1. 
Positive correlations indicate the association of high degrees of a 
characteristic. The means of certain variables were adjusted so that 
a higher score indicates a higher level of performance. In the case of 
the Authoritarian-Equalitarian (A-E) Scale, a high score indicates 
a more authoritarian attitude. 

The correlations in Table 1 were computed as a first step in de- 
termining the correlation of the variables, and of certain residuals, 
with the residual gain criterion. While the correlations were gon 
erally low, 18 of the 50 coefficients were large enough to be statis- 


28 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 1 


Intercorrelations, Means, and Standard Deviations of Variates 
(N = 211) 


0. Second Super- 


vising Officer 

Evaluation 03 —.03 .06 08—09 13 —.14 16 36 7 
1. Education 13 13 34-19 08 01 06 23 .04 
2. Pre-test 18 .46 —.28 —.08 —.03 —.17 .33 .03 
3. End-test 26 —.19 .03 —.13 —.19 .40 12 
4. GCT —37 —.06 —11 .02 .32 .02 
5. Peer Ratings 08 .07 .03 —.37 —.08 
6. Self-Ratings —.08 10—09  .01 
7. A-E Scale —.06 —.18 —.16 
8. Age —.3  .16 
9. Leadership 21. 

School Grade 
10. First Supervising 

Officer Evaluation 


Note: Correlation of .14 is required for .05 confidence level; .18 for .01 confidence level. 


tically significant at the .01 level of confidence, and five more at the 
.05 level. 

Table 2 shows the correlations of residual gain with the other foul 
gain measures. The highest correlation in this table is the partial 
correlation between residual gain and the Chief Petty Officer Lead- 
ership School grade residual. This latter residual was the result of 
partialing the variance associated with the pre-test out of the Chiel 
Petty Officer Leadership School grade, (Xs). The correlation wa 
20, which is statistically significant at the .05 level. The part cor 
relation of residual gain and self-ratings on gain in leadership be 
havior, made by the chief petty officers themselves after they ha 
been back on the job for two months, was .17, also statistically Sig 
nificant at the .05 level. The correlation with peer ratings on ga 
in the Chief Petty Officer Leadership School was —.05, while th 
correlation with the remaining gain measure, the end-test with the 
variance associated with the pre-test partialed out, also correlated 
essentially zero with the residual gain criterion. 

Two additional statistical steps were taken, neither of whieh 
provided any information of special interest. The first consisted û 
computing the part correlations between residual gain and the non 
gain variables, namely, education, pre-test, end-test, GCT, A- E 
Scale, age, and leadership school grade. None of these correlationi 


MAYO AND DvBOIS 2 


was statistically significant, the highest being .16 in the case of the 
leadership school grade. The second step consisted of computing the 
partial correlations between residual gain and the other eight vari- 
ables with intelligence, as measured by the Navy General Classifica- 
tion Test, partialed out of each one. None of these correlations 
proved to be statistically significant. 

Discussion 

The significant biserial correlation between membership versus 
nonmembership in the leadership training group on the one hand 
and residual gain in supervising officers’ evaluations on the other is 
interpreted as evidence of the effectiveness of the Chief Petty Officer 
Leadership School. This finding is contrary to the usual finding of 
no significant difference in on-the-job performance following human 
relations training in industry as, for example, in the work of Harris 
and Fleishman (1955). It is difficult to account with confidence for 
the difference in the two findings, but it may well be that the five- 
week leadership course in the present study was more comprehensive 
than that used in the study conducted by Harris and Fleishman, 
which was two weeks in length. Another important difference doubt- 
less was the military setting of the present study as opposed to the 
civilian setting of the earlier study. 

One might like to attribute the positive results shown in the 
present study to more refined measures, such as the residual gain 
criterion. This may be true to some extent, but the positive results 
may not be attributed entirely to the procedures used since it was 
ascertained that a simple comparison of the number of subjects in 
the control group and in the experimental group, whose leadership 
behavior improved on the second evaluation, remained the same, or 
declined, resulted in a statistically significant difference also. The 


TABLE 2 
Correlation of Residual Gain (Холо) with Other Gain Measures 
(N = 211) 
ee 
Peer Self- End-test School Grade 
Ratings Ratings Residual Residual 
Residual Gain Xs 6 Xia Xia 


* Statistically significant at .05 level of confidence. 


30  EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


chi square value of the comparison of these two sets of figures was 
50.50. With two degrees of freedom, as is the case here, a chi square 
of 9.21 is statistically significant at the .01 level. When converted 
to a contingency coefficient this chi square value yields a C of .33. 
Therefore, with unrefined statistical procedures a significant differ- 
ence still exists between the group that had had leadership training 
and the group that had not. 

It is true of course that the commissioned officer who evaluated 
the leadership performance of the experimental and control groups 
knew which men attended the school and which ones did not. It is 
pointed out, however, that the supervising officers were not asked 
whether or not the man whose leadership behavior they were evalu- 
ating on the second evaluation had improved. They were asked only 
to compare it with that of the two chief petty officers used as refer- 
ence points and to evaluate accordingly. Further, it is doubtful that 
the supervising officers retained a record of their first evaluation 
since no file copies of the form were provided. If some of the super- 
vising officers remembered the previous marks and, contrary to the 
directions, used this information instead of evaluating the man’s 
leadership behavior as described on the form, it is a matter which 
unfortunately could not be controlled. 

It may be significant that the only two variables which correlated 
significantly with residual gain were themselves gain measures, 
namely, the self-ratings made by graduates of the school after they 
had been back on the job for a period of two months, and grades as- 
signed by the school with initial standing as measured by the pretest 
partialed out. Inasmuch as peer ratings have been effective in pre- 
dieting on-the-job performance in a wide variety of military situa- 
tions, the low correlation of peer ratings with residual gain was not 
anticipated. In retrospect it appears that the chief petty officers 
tended to nominate members of their platoon who had low initial 
standing with the thought that these men had gained more from the 
school than the members of their platoon who started at a high 
level. The significant negative correlations between peer ratings and 
the Navy General Classification Test, education, pre-test, end-test, 
and leadership school grade suggest that this was the case. It may 
be that this handicap was too great for the men nominated by their 
peers to overcome. In any event, peer ratings on gain in the Chief 
Petty Officer Leadership School were not effective in predicting the 
residual gain criterion. 


| 
| 


MAYO AND DvBOIS 31 


Summary 

Residual gain and related correlational procedures were applied 
{о the measurement of gain in leadership performance following а 
| five-weeks course in naval leadership. The study involved an experi- 
- mental group of 211 chief petty officers who received the leadership 
training course and 211 chief petty officers who did not receive the 
training. Significant differences were found between the leadership 
performance of the two groups on the job, as observed by supervis- 
ing commissioned officers, two months after the experimental group 
returned to their original duty stations. The two measures which 
related most closely to residual gain were also measures of gain, 
namely, self-evaluation of gain in leadership behavior made by the 
chief petty officers themselves after they had been back on the job 
for two months and the grade assigned the graduates by the Chief 
Petty Officer Leadership School with initial status as measured by 
a pre-test partialed out. Experience with residual gain in the context 
of leadership measurement suggests that it has much to recommend 
it when one wishes to measure gain in educational or psychological 
work. 


REFERENCE 


Brookover, W. B. “The Relation of Social Factors to Teaching 
ao: Journal of Experimental Education, XIII (1945), 191- 


DuBois, P. H. Multivariate Correlational Analysis. New York: 
Harper & Brothers, 1957. A T. 

Harris, E. F. and Fleishman, E. A. “Human Relations Training and 
the Stability of Leadership Patterns." Journal of Applied Psy- 
chology, XXXIX. (1955), 20-25. 1 

Manning, W. Н. “Antecedents of Part Correlation in Research on 
Learning." In DuBois, Р. Н. and Manning, W. Н. (Eds.), Meth- 
ods of Research in. Technical Training (Revised Edition), Office 
of Naval Research Technical Report No. 3, Contract Number 
816 (02), Washington University, April, 1961. — $ 

Manning, W. Н. and DuBois, Р. Н. “Gain in Proficiency аз à Cri- 
terion in Test Validation." Journal of Applied Psychology, XLII 
(1958), 191-194. 

Mayo, G. D. and Manning, W. H. “Motivation Measuremen Ер- 
a at AND PsvcHoLocicAL MEASUREMENT, ХХІ (1961), 


3-83. 
Sanford, F. H. Authoritarianism and Leadership. P. hiladelphia: In- 
| stitute for Research in Human Relations, 1950. 
Thorndike, E. L., Bregman, J., Tilton, J. W., and Woodyard, E. 
Adult Learning. New York: Macmillan Company, 1928. 
Woodrow, H. “The Ability to Learn.” Psychological Review, a 
(1946) , 147-158, 


E x 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 1, 1963 


Y DEVIANT RESPONSE TENDENCIES: 
THEIR MEASUREMENT AND INTERPRETATION! 


LEE SECHREST 
Northwestern University 
AND 
DOUGLAS N. JACKSON 
Pennsylvania State University 


IN recent years a wide variety of systematic biases in the response 
tendencies of individuals have been identified. Few of these, how- 
ever, are of greater potential theoretical or practical importance than 
the one described by Berg (1955), accounted for in terms of his 
“deviation hypothesis,” and further elaborated in a number of re- 
search studies (Adams & Berg, 1961; Barnes, 1955; Berg, 1957, 1959, 
1961; Berg & Collier, 1953; Hesterly & Berg, 1958; Roitzsch & Berg, 
1959). 

The initial statement of the deviation hypothesis was: ‘Deviant 
response patterns tend to be general; hence those deviant behavior 
patterns which are significant for abnormality and thus regarded as 
Symptoms are associated with other deviant E patterns which 
are in noncritical areas of behavior and whichéare not regarded as 
Symptoms of personality aberration" (Berg, 1955, p. 62). 

It is our purpose to attempt a clarification and explication of the 
measurement and interpretation of deviant response tendencies. We 
have gained the impression from a study of published reports and 
Írom our own research on deviant response tendencies that certain 


1 Based on a osium paper given at APA, 1960. Supported in part by the 
National Institute ol Mental: Health, Public Health Service, under Research 
[Grant M-2757 to Northwestern University and M-2738 to Pennsylvania ЕЕ 
University. The authors thank Judith Schmerling for assistance and Samue 
essick for a careful reading of the manuscript. 


ы 33 


34 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


conceptions in this area, not necessarily attributable to Berg, permi 
a variety of interpretations, some of which are, in our view, mutuall 
exclusive or otherwise untenable. Here as elsewhere in personality 
assessment, measurement procedures are intimately related to 
questions of interpretation; alternative indices of deviant tendencies 
may result in very different scores for particular individuals, and 
hence will have varying theoretical implications. By explicitly 
delineating alternative interpretations and measurement operations; 
we hope to provide a sounder basis for an appraisal of the conditions 
under which various interpretations of the deviation hypothesis are 
and are not supported by empirical observations. In this ventur 
we shall discuss first the generality assumption, and then take up Ше 
question of alternative conceptual and measurement definitions Ol 
deviant response tendencies. 


The Generality of Deviation 


It has been suggested (Berg, 1959) that psychotics, lawyers, cal 
diac patients, transvestites, young normal children, character dis- 
orders, the obese, the feebleminded, psychoneurotics and persons 
suffering from constipation, among others, represent deviant group 
which might be expected to manifest their particular propensities 0 
ward deviation not only in a modality relevant to their particula 
symptoms and to items with relevant content, but also in respons 
io one or more of the following: preference for abstract drawings 
food aversion questionnaires, stimuli for conditioned respons es; 
autokinetic and spiral aftereffect situations, vocabulary test items 
figure drawings, musical sounds, and olfactory stimuli. Berg (1999 
р. 95) has stated: “Indeed, any content which produces devia 
response patterns will serve, judging from the available evident 
... Accordingly, for personality and similar tests, a particular iten 
content is unimportant.” 

We would like to examine the generality assumption of the de 
viation hypothesis and what would appear to be its corollary, t 
alleged unimportance of item content. If it is proposed that devia? 
response tendencies are wholly general, that is, that an indivdual wh 
is deviant on any one measure is likely to be deviant on all ota% 
measures, the deviation hypothesis has some rather overwhelmia 
implications. One is naturally suspicious of such universal rules } 
psychology, having learned through hard experience that psyc% 


SECHREST AND JACKSON 35 


logical processes are usually more complicated than the initial 
optimistic attempts at explanation would suggest (cf. Cronbach, 
1958). On the other hand, it might be objected that what is actually 
being proposed is only that deviations in critical areas, for example, 
psychopathological symptoms, are likely to be associated with de- 
viations in certain noncritical areas. Nevertheless, if one supposes 
that some general factor of deviation tendencies accounts for the 
association between deviations in critical and noncritical areas, 
then deviations in noncritical areas might be expected to be asso- 
ciated, if all variables, critical and noncritical, shared a mutual rela- 
tion to a common factor of general deviation. The alternative, of 
course, is a limitation on the deviation hypothesis, a restatement 
in a probably more accurate but unfortunately less interesting form; 
namely, “Deviant response tendencies are sometimes associated.” 
If so stated, the task of the researcher then becomes one of deter- 
mining the limiting conditions under which the deviation hypothesis 
might be said to hold. The specific ways in which people are de- 
viant then become of central importance in appraising the gen- 
erality of the deviation hypothesis. If specific deviant responses to 
particular personality items show significant discriminant validity 
for identifying certain psychopathological symptoms, then it is 
Teasonable to inquire into the properties, even the content, of these 
particular differentiating items. 

There is, of course, the danger of lapsing into triviality in pro- 
Posing that deviant response tendencies are less than completely 
general. Probably there are few, if any, persons who are not de- 
viant in some manner or other (none, if we read Berg correctly) and 
it would necessitate little courage to stipulate that persons who were 
deviant in one way would be found to be deviant in some other way. 
At the least one must answer, if not in all other ways, then in what 
other ways, 

Even though not directly suggested by Berg, it might naturally 
occur to one that deviant response tendencies should be associated 
even in noncritical areas. If deviation tendencies are general in 
the sense that persons tend to perform in a deviant manner in а va- 
Tlely of situations, it would be useful to appraise the extent to which 
normal subjects showed deviant behavior over varying classes of 
responses. It should be noted that the generality assumption would 
not necessarily require that correlations be found between independ- 


36 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


ent measures of deviant responding, although the correlational model 
is conceptually simpler and thus, perhaps, preferable. Relationships 
between various deviation measures might be considered lawful if 
subjects deviant in one direction on one measure were deviant in any 
direction on a second measure. For example, if there were a rela- 
tionship between deviant height and deviant intelligence, it might 
be found that deviantly tall subjects were either brighter or duller 
than normally tall subjects and that similar findings obtained for 
the deviantly diminutive. Obviously in such a case the height-intelli- 
gence correlation might well be zero. 

In an empirical evaluation of the generality of deviant response 
tendencies, the authors (Sechrest and Jackson, 1962) followed both 
a correlational and a nondirectional approach to the analysis of de- 
viant responding. Very little evidence appeared in the data in support 
of the generality of “noncritical” deviant responses in normal sub- 
jects. 

Of course, such а finding of broad generality would indeed be 
surprising. A generality hypothesis in one of its simplest forms— 
that deviation is assumed to be unidirectional and associated across 
diverse measures—would require all variables to be correlated in 
some degree. Numerous unpublished doctoral dissertations disconfirm 
such a naive and somewhat absurd one-dimensional formulation. It 
should be noted that Berg (personal communication, 1960) explicitly 
disavows this interpretation of the deviation hypothesis, although cer- 
tain of his writings, (e.g., Berg, 1959) might lead a reader to conclude 
otherwise. , 

In the light of the paucity of evidence that deviant response tend- 
encies are broadly general, and especially in the light of the со 
pelling logical reasons that they cannot all be correlated, we sid 
examine the alternative, that is, that those individuals who are devi- 
ant on one measure, or in one way, will be deviant in some other 
way. We have already pointed out that there is the danger of trivia 1 
ity in such a statement, but triviality may be avoided if we are cleat 
in a specification of the limitations we are imposing on the deviation, 
hypothesis. We also believe, however, that when we begin to 6 
amine the logieal status of a limited deviation hypothesis, there are 
some considerations, some already mentioned above, which seem 
definitely to contradict the statement that “particular stimulus c0D* 
tent is unimportant for measuring behaviors in terms of the Devis- 


- 


SECHREST AND JACKSON 37 


tion Hypothesis" (Berg, 1959). For it becomes apparent that it is 
not true that just any old content will suffice, not even content which 
yields deviant responses. Undoubtedly there are ways, for example, 
in which schizophrenics might not differ from normals, e.g., in ciga- 
rette preferences, even though there are many cigarette preferences 
Which are deviant. Once we begin to specify that not all deviant re- 
sponse measures are equally good, content makes the scene. 

Аз a matter of fact, Berg has apparently never really gone beyond 
the statement that deviant response tendencies are general in denotat- 
ing the Deviation Hypothesis. He has never really proposed or at- 
lempted to demonstrate generality of deviant response tendencies 
within groups. His research on the deviation hypothesis has centered 
on the study of various "criterion" groups which may be supposed to 
be deviant in some “critical” way. Such criterion groups are contrasted 
_ most frequently with normal groups but occasionally with each other 

for evidence of deviation and noncritical modes. It has never been 

made clear, however, just what constitutes a “critical” deviation. Why 

is it assumed that lawyers, cardiac patients, young normal children, 

and sufferers from constipation are critically deviant? What would 
‚ not constitute, then, a critical deviation, and who would not be the 
unproud possessor of one? 


Response Styles and Deviation 


One line of interpretation of deviant responses favored by Berg 
(cf. 1959, 1961) is to note first that there are certain modal response 
references to personality questionnaires and other devices in the 
eneral population from which certain defined criterion groups may 
found to depart to a significant degree. The emphasis here is not 
проп a deviant response to a particular category of content, e.g., à 
hizophrenie responding “true” to the item, “I hear voices when 
ere is no one around,” but rather upon some systematic pattern 
of responding which differentiates normal subjects from the deviant 
criterion group. These consistent tendencies, variously termed “re- 
onse sets” (Cronbach, 1946; 1950); “biased responses” (Barnes, 
955; 1956a; 1956b; Berg, 1955; 1961) ; “response style” (Jackson & 
essick, 1958; 1962) or “stylistic variance” (Wiggins, 1962a) have 
ееп proposed as serving as the basis upon which criterion groups 
Шау be differentiated from normal groups, even when the item con- 
tent is not “critical,” i.e., when there is no a priori basis for inter- 


38 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


preting the content as relevent to the considerations upon which the | 
criterion group was chosen. 

Thus, it has been proposed by Barnes (1956a; 1956b) that neu- 
roties respond deviantly "false" and psychotics respond deviantly 
"true" to MMPI items. This finding was interpreted by Jackson 
and Messick (1958; 1962) and Wiggins (1962a) as indicating that 
in personality questionnaires acquiescence may be differentially elic- 
ited at varying levels of item desirability within various criterion 
groups, since it is known that response probability is a function of 
item desirability (cf. Edwards, 1957). 

If this were as far as Berg’s argument went, i.e., that there are cer- 
tain noncontent determinants of responses which might have validity 
for certain purposes, there would be little room for disagreement, 
for the evidence already accumulated is impressive, and the research 
of Berg and his students has substantially contributed to this litera- 
ture. However, Berg goes considerably beyond this point. Not only 
are we told that content is unimportant (Berg, 1959), but further- 
more that, because young, normal children and schizophrenics had 
a similar pattern of deviant responses on the Perceptual Reaction 
Test (PRT) (Berg, Hunt & Barnes, 1949) when contrasted with 
normal adults (Hesterly & Berg, 1958), the results were considered 
to support the notion that "schizophrenie responses are character- 
ized by immaturity" (Berg, 1961, p. 352). We shall not dwell on the 
question of by what rule of logic one may conclude that when A and 
B differ in some particular respect from C, but not from each other, 
they are necessarily alike in other aspects as well, or the question 
of why it is not concluded by similar logie that young, normal chil- 
dren are schizophrenic. Rather, the point to be emphasized is that, 
because in most structured inventories the number of response al- 
ternatives is severely restricted (the PRT has only four alterna- 
tives), any deviant pattern is likely to be correlated with any other. 
To take a more extreme example for illustrative purposes, if there 
are only two response alternatives for an item, and one is keyed 
“normal,” then for a set of such items all “deviation” keys keyed P 
for every item must be perfectly correlated, The “generality” of F 
deviation under such conditions would be spurious, or, at best, in- 
determinate. 

It is for precisely this reason that Jackson and Messick (19622). 
have proposed, with evidence, that in cases in which response alter- 


SECHREST AND JACKSON 39 


natives are markedly limited, and where massive response style 
effects are revealed, as in the MMPI, the opportunities for valid dif- 
ferential diagnoses on the basis of test scores are severely restricted, 
even though the test may distinguish pathological groups generally 
from normals. Barnes’ (1955) findings that pathological groups can 
be distinguished from normals with a seven-minute PRT about as 
well as with the much longer MMPI is important evidence bearing 
on this point, but it need not necessarily be taken as evidence for 
the alleged “generality” of deviation, except perhaps insofar as 
pathological groups may generally show a higher rate of random re- 
sponding (cf. below). One approach to this dilemma, one followed 
out of necessity by Strong (1935) in the area of vocational interests 
after finding that professional groups appeared similar to one another 
when contrasted with “men in general,” and attempted with some 
Success by Barnes (1955) with psychiatric patients, is to contrast 
the responses of different “deviant” criterion groups, in an attempt to 
find discriminating items and keys. Unfortunately, Berg, in his 
quest for evidence bearing on the “generality” of deviation, has 
favored contrasting various deviant groups from normals to this 
more sensitive differential approach. For most purposes of assess- 
ment, the latter approach will yield more useful and interpretable 
information, 


Definition of Deviation 


We believe that the idea of deviation or of being deviant might 
be considered and applied in quite a number of different ways, not 
all of which are equally important and not all of which are entirely 
compatible with each other. We would like to consider a bit more 
extensively the nature or definition of the deviant response itself, or 
Perhaps we should say the nature of deviant responses, for, as 
we hope to show, there are a variety of ways in which persons may 
contrive to be deviant on a single test or assessment device. For 
that matter, the determinants of the same deviant response are very 
likely different for different persons. 

We want to describe six definitions of sources of deviation. These 
are: (a) absolute deviation; (b) relative deviation; (c) statistical 
frequency ; (d) extremeness of traits; (e) unique structuring of 
traits; and (f) randomness of response. 


40 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Absolute and Relative Deviations 


First of all, Berg has stated, “...we have defined а deviant. 
sponse as one which differs from the modal response or from a @ 
terion group response" (Berg, 1959). There are, in that statemel 
two definitions of deviation. The first, which we prefer to call rel 
tive deviation, makes use of the responses which are atypical in t 
group being studied. That is, the modal responses are identified, а 
all other responses are termed deviant. The second, which we sli 
call absolute deviation, requires the identification of some gro 
which may on a priori grounds be termed deviant. Those noneritit 
responses which differentiate the a priori deviant group from a co 
parable normal group may be used to develop а deviant respo 
key. Barnes (1955) has developed a key on the PRT based on 
absolute definition of deviation. Barnes’ key, the Delta scale, } 
devised to distinguish hospitalized psychiatrie patients from n orm 
subjects. Sechrest and Jackson (1962) followed the relative ke үй 
procedure and developed a key based on the modal and deviant: 
sponses of college and nursing students. The relative deviation ke 
yields scores which, for college students, correlate .70 with Deli 
scores based on absolute deviation tendencies. Since, in that ва mp! 
neither the Deviation nor the Delta key had Kuder-Richards 
Formula 21 reliabilities in excess of .70, the obtained correlaü 
would seem to indicate some substantial similarity between the 
methods of developing keys. Apparently those responses which £ 

unusual in a normal population are, for the most part, those W h 
are typical of pathological groups. As we have pointed out abo 
to a large extent this must be so because of the built-in limitati 
upon response alternatives with resulting high correlations betw 
various deviant keys. 

There are some rather evident connotational differences beti 

a definition of deviation couched in terms of discrepancy from 8 

modal response pattern on the one hand and a definition in term 

similarity to some deviant group on the other. In the case of 8 

ilarity to some deviant group as the defining property, there 1! 

least the implicit assumption of “critical” similarity fo gn 

which are alike in noncritical ways. Hesterly and Berg (1958), 

example, report that young normal children differ from mof 

adults but do not differ from adult schizophrenics, with the 88 


SECHREST AND JACKSON 41 


conclusion that young normal children and adult schizophrenics are 
alike in their “immaturity.” On the other hand, to define deviation 
in terms of departure from the modal or typical pattern is to imply 
that to be deviant is to deviate from some particular group and that 
whether one is deviant at a given time depends on the context within 
which one is being evaluated. An Eskimo is a deviant in Chicago 
but not in the Yukon. A schizophrenic is deviant among us—but is 
he in a psychiatric hospital? We want to make it clear that we are 
not necessarily endorsing nor deprecating either definition of devia- 
tion. We simply believe that they are different and that the differ- 
ence should be recognized. 


Statistical Infrequency as Deviation 


There is another way in which the idea of deviation is employed, 
particularly with respect to the measurement operations involved, 
if not necessarily in connotative meaning. Very often, it appears, 
the term deviant is applied to the statistically infrequent event or 
object without respect to the larger context in which it appears. We 
believe that the criterion of statistical infrequency is being invoked 
when, for example, various occupational groups are spoken of as 
deviant "criterion" groups and comparisons are made between such 
deviant groups as lawyers, engineers, or physicians and "people in 
general" (normals?). Now it is only in the purely statistical and 
artificial sense of infrequency that any occupational group could be 
considered deviant. There is no single occupational group that con- 
tains anything like a majority, or even a substantial minority, of 
the workers who constitute “people in general.” Everyone is deviant 
in his occupation when contrasted with “people in general"—steam- 
fitters, plumbers, chemists, bookkeepers, taxi drivers, even psychol- 
ogists. Used in such a way, that is, to describe as deviant the mem- 
bers of any occupational group, or the members of any other class 
of objects solely on the basis of the fact that there are not many of 
them, the deviation hypothesis is divested of much of its meaning 
and becomes little more than an affirmation of individual differences. 
Deviation makes sense only when the number of possible classes is 
limited and when, we might add, the discrepancies in frequency are 
marked. As an instance of the latter point, we would be reluctant 
to call men in our society deviant because they are outnumbered by 
women. 


42 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Extremeness of Characteristics as Deviations 


The term deviant is also applied to those persons who appear at 
the extreme of some dimension, e.g., intelligence or height, although 
the exact rules for application are not clear. There are two some- 
what different conceptions of the deviant which are consistent with 
a rather ancient argument in psychology. There is, first, a frequent 
use of the term deviant to describe a group of individuals who depart 
in a single, uniform direction from an undifferentiated mass of 
“normal” or nondeviant subjects. Thus, for example, typical prat- 
tice is to identify persons as anxious and nonanxious, pathological 
and nonpathological, feebleminded and normal, etc. Viewed in such 
а manner, deviation may be looked upon as a unipolar trait with 
some degree of rarity in the population, and the distribution of devi- 
ation scores should be markedly skewed with а mode of 0, that is, 
no deviation. 

However, there is also the possibility that deviation could be con- 
sidered a bipolar trait yielding à normal distribution with the mode 
at a point separating those subjects who deviate in either direction. 
We might, then, want to identify the anxious, the average and the 
nonanxious persons with both the anxious and nonanxious being ` 
deviant in the population. Or, we could consider both tall and short 
persons deviant, the nondeviants representing those persons of me- 
dium degree of height. 

Now presumably for either the Delta or the Deviation key the 
subject who obtains a high score, that is, who answers many items 
in the deviant direction is deviant. And he should be deviant in 
other ways. However, even а deviation scale has two ends, and sub- 
jects may score low as well as high on such a scale. For example, in 
their sample of nursing students Sechrest and J ackson (1962) found 
that the mean and median of the distribution of Deviation score 
were almost identical and the scores ranged from just a little better 
than two and one half standard deviations below the mean to about 
the same range above the mean. What of those persons who scr’ 
well below the mean on the Deviation key? They might appropr- 
ately be deseribed as being deviantly nondeviant. (Wiggins [19625] 
has suggested the term “hypercommunality” to refer to this type? 
responding). They adhered to the modal responses more consisten y 
than did other subjects. There were а number of lines of evident? 


“эр” 


SECHREST AND JACKSON 43 


which pointed to the probable accuracy of the conclusion that low 
scoring subjects are deviant and in certain of the same ways as high 
scoring subjects. To take but one example, both high and low seoring 
subjects received more nominations on the sociometrie type variables 
“most pleasant" and “least pleasant.” High scorers were not con- 
sistently seen as either more pleasant or less pleasant than low 
scorers. Both were simply named more often than middle scorers. 
"This finding was consistent across a variety of different traits. Thus, 
subjects scoring either deviantly high or deviantly low were more 
salient in their group. There were, however, some instances in which 
both the high and low scoring deviants are deviant on other meas- 
ures, but in opposite directions. For example, a verbal intelligence 
measure indicated that, as compared with a middle scoring group on 
the Deviation measure, the high scorers were lower in intelligence 
and the low scorers were higher in intelligence. Since we may just 
as well consider as deviant those who depart from the typical in one 
direction as another, deviant scorers on any particular measure do 
not in the sources of their deviation constitute a homogeneous group. 
We would conclude that it is appropriate to consider deviation as 
having more than a unidirectional interpretation. 


Unique Structuring of Traits as Deviation. 


There is still another way in which persons may be deviant or in 
whieh deviant response tendencies may arise. Generally we are 
bound in our thinking, concerning the relationships between traits 
of characteristics of persons, to the conception of uniform relation- 
ships in the samples with which we deal. Those persons who do not 
conform to our models or our ideas are simply relegated to the 
anonymous ranks of error if they are considered at all. Yet, such 
persons may represent alternatives to our usual conceptions of the 
universe of “average people.” Consider the correlation scatterplot 
given in Figure 1. We usually assume that the relationship between 
two measures of intelligence is positive. And in this case the rela- 
tionship between ACE Q and L scores is positive. The correlation 
is .46. However, note that it is not positive for all of the subjects. 
For those represented by the small circles in the plot, the relation- 
ship could certainly not be considered to be positive and significant. 
In one sense at least, in the sense that they lower the correlation, 
they are deviant, One possibility is that for these subjects there is 


44 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


ACE 9 


Figure 1. Correlation Scatterplot for ACE Q and L Scores for 60 Nursing 
Students, т = .46. 


some unique structuring of abilities that is not found among most 
persons. 

From the standpoint of traditional test theory, it might be argued 
that by inferring the existence of unique structuring of traits, one 
ignores the possibility that such deviation is merely error of meas- 
urement. Such an argument appears compelling—save for one thing. 
Utilizing data previously reported (Sechrest & Jackson, 1961a), 
have plotted the correlations between three distinctly separate pair 
of variables? and have identified those persons whose locations 0 
the plots.fall outside lines drawn parallel to and one standard ег 
from the mean regression line for the regression of the one variabl 
on the other (e.g., see Figure 1). In a sample of 60 nursing studen 
a total of 35 were correlational “outliers” on one or more of the cor 
relation plots. The remaining 25 were, then, consistent perform! 

2 The three pairs of variables and the correlations between them were: ACB? 
L score vs. ACE - Q score, 46; MMPI Pd scale vs. Reputational unpredic 
ity, 37; Reputational social intelligence vs. Reputational pleasantness, 


Choice of variables upon which to calculate regressed scores was accomp! 
by toss of a coin. 


SECHREST AND JACKSON 45 


who would produce very high correlations. Their other interesting 
characteristic was that they scored significantly lower on the PRT- 
Deviation Key than did those subjects who were keeping the cor- 
relations low. The mean PRT-D score for the consistent group was 
17.28 and for the “outlier” group 20.66 (t = 2.50; p < .02). This 
finding has been supported, though at a lower level of significance 
(10 < p < 20), for three correlations involving different variables 
in a group of 123 college students. 

This problem of uniqueness in the structuring of traits or abilities 
should be afforded increased attention in psychometric methodology. 
While for many purposes it would be more realistic to base general- 
izations from factor analytic results upon consistent but unique 
organizations of traits in homogeneous subsets of subjects, most 
research workers concerned with the factor analysis of correlations 
between test scores assume that for each test the score of each person 
is the weighted sum of his factor scores, the weights being constant 
for each test, and the factor scores bearing the same interpretation 
for all people (Loevinger, 1948, p. 522). Other than the factor anal- 
ysis of correlations between persons prior to separate analyses of 
tests—a method used extremely infrequently, possibly because it is 
not widely understood—or, alternatively, the a priori selection of 
subsets of subjects thought to share certain unique clusterings of 
traits, there is not a factor analytic method available to take into 
account analytically possible lack of homogeneity of the structure 
of characteristics in different people. Because а factor analysis of 
correlations between people cannot take into account the influence 
of moderator variables or other curvilinear effects, and а priort clus- 
tering is cumbersome, imprecise, and requires more information than 
is usually available, a method for categorizing subjects homogeneous 
with regard to their constellation of traits is urgently needed. An 
approach to this problem recently outlined by Tucker (1962) prone 
ises its eventual solution. Application of a line of reasoning similar 
to that described above to individual differences in multidimensional 
sealing (Tucker & Messick, 1960) has been found to be quite useful 
(Jackson & Messick, 1962b). 


Random Responding and Deviation 


We would like to offer one additional and tentative hypothesis 
about the nature of deviant responses. There are obviously any 


46 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


number of determinants of a score which constitutes the total num- 
ber of deviant or atypical responses that an individual makes оп 
given task or during a particular period of time. One of these may 
be sheer carelessness in responding, Parenthetically, let it be sai 
that we are not denying that carelessness may in and of itself be ar 
interesting response. Nevertheless, it may be pointed out that, if 
subject responds randomly or carelessly to a set of items, it is in 
evitable that he will make an unusual number of deviant respo: 
just as a “pair of dice” will show markedly deviant persona 
“scores” (Burnham & Crawford, 1935). This, of course, provi 
the entire rationale for the F scale of the MMPI. Therefore, to th 
extent that carelessness contributes to higher deviation scores, or, t 
put it another way, to the extent that higher deviation scores 4 
correlated with careless responding, we would expect higher th 
average deviation scores to be less reliable than lower scores. Fron 
a population of 183 college students and nursing students (Sechrest 
& Jackson, 1962), the 45 who scored highest on the Deviation key 


offer as evidence a corrected reliability of —.10 for the high group 
and 30 for the low group, it is clear that the two reliabilities diff 
ш а way which is consistent with the hypothesis that deviant 
Sponses may be careless responses, One additional bit of data bet 
Ing on the carelessness hypothesis stems from the analysis of 
Deviation scores of subjects who are correlation “outliers,” i 
whose performance is predicted poorly by regression scores. 
Possible relationship between deviation as carelessness and as а 
unique structuring of traits is evident, To the extent that perform 
ances are careless and unreliable, correlations between any two § 
of variables will be attenuated. Perhaps then, deviant response te 


dencies may be, in part, a reflection of the rate of random respond 
for a subject. 


Problems in the Measurement of Deviant Response Tendencies 

While many of the problems to 
of deviant response tendencies 
which has preceded, we would li 
difficulties which, 


be encountered in the measurement 


SECHREST AND JACKSON 47 


ticularly likely to become sources of problems to workers interested 
in the deviation hypothesis. 

These four difficulties may be summarized as follows: (a) a 
global index of deviation obscures information as to direction of 
deviation; (b) it may be of interest to consider scores which are 
atypically nondeviant; (c) deviant tendencies are almost certainly 
multidimensional, so that a global index of deviation is difficult to 
interpret; and (d) reliable indices of deviant response tendencies are 
difficult to construct, due to the greater frequency of errors in devi- 
ant scores, and to psychometric limitations imposed by extreme item 
splits. 

First, we believe that the probability that in many areas deviation 
may occur in opposite directions from some midpoint poses serious 
measurement problems, Identical numerical representations of а 
degree of deviation from a midpoint may have quite different inter- 
pretations. It is not entirely satisfactory to be in the position of 
ignorance with respect to the direction of relationships between two 
deviation measures. For example, referring once more to the finding 
that subjects high in Deviation tendencies score lower on a standard 
test of verbal intelligence, we are left in a quandary concerning the 
implications of that finding. Why the deviants on the Perceptual 
Reaction Test should have deviated on an intelligence measure in 
the low rather than in the high direction (or in any direction at all) 
is not at all clear to us. A theoretical rationale which will subsume 
the relationships between deviation measures is badly needed. 

The second measurement problem is posed by the occurrence of a 
set of responses which are less deviant than they ought to be. That 
is, as we have previously reported, it is quite possible to obtain from 
a subject a series of responses to the Perceptual Reaction Test 
which conform to the modal response tendencies more than do most 
people's responses. These individuals are, as we have termed them, 
deviantly nondeviant. We do not believe that such subjects can 
simply be ignored, that they can casually be lumped together with 
the normally nondeviant—or to put it another way, the normally 
deviant—individuals. 

Thirdly, there is the problem of multidimensionality in deviation 
scores. Since the problem of dimensionality of global indices and 
distance scores has been treated in detail elsewhere (Lykken, 1956; 
Cronbach, 1958; Jackson, 1958) we need only point out that to the 


48 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


extent that deviation is not a single dimension, that persons may be 
deviant along several dimensions, a single index may obscure res 

differential effects, for example, of two deviation dimensions corre- 
lating in opposite directions with an outside eriterion, and in a 


such dimensions are at very best difficult to interpret. 

The fourth measurement problem stems from general psychome= 
tric theory about the characteristics of test items. As we have 
shown, it is evidently not a routine task to construct a relia 
measure of deviant response tendencies, and perhaps particula 
so with the more deviant individuals, Unreliability or randomn 
in responding contributes to increased deviation scores, and deviant 
individuals may by their very nature be inconsistent and unreliable 
in their responses. The problems to be faced by the measurement 
specialist are considerable, 

A number of writers (Brogden, 1946; Cronbach & Warrington, 
1952; Gulliksen, 1945; Lord, 1952) have indicated that under ordi- 
nary circumstances item difficulty levels or alternative preference 
indices after correction for guessing should not depart markedly 
from the fifty-fifty, or even, "right-wrong" E. And yet, concep- 


tually, if we are keying items to identify those persons who respond. 


in а deviant manner on a particular test, there is strong reason to 
prefer rather extreme sp 


lits. For example, to take two rather ex- 


treme cases, suppose we have a large number of persons responding 
to two two-choice items. Item I 


tren on for concern about the inherent li m- 
itations on the reliability and "validity" of items which have very 
extreme splits. One might, then, 
be in order and that items would 
edly uneven, but not extreme spli 


encies to respond deviantly there is empirical reason to prefer item Я 


SECHREST AND JACKSON 49 


which (a) have a fairly strongly preferred modal choice and (b) 
deviant alternatives which attract only a small percentage of re- 
spondents (Sechrest & Jackson, 1961b). "There might be some psy- 
chometrie gain from utilizing only items and alternatives which 
meet the above criteria. 

We believe that these rather paradoxical findings may be resolved 
if a deviation score is viewed as a point on a bipolar continuum. At 
one end lies typicality or nondeviation, being measured by the 
choice of modal responses, and at the other end lies an opposing 
tendency, deviation or atypicality, being measured rather separately 
by certain deviant responses. An individual's deviation score is, in 
actuality, the algebraic sum of two scores. The good items for the 
assessment of nondeviation are not necessarily the same as those 
good for assessing deviation. For example, a good item for measur- 
ing nondeviation would be one on which only about 1/3 of the re- 
spondents chose the most popular alternative. If, however, none of 
the remaining three alternatives was particularly unpopular, the 
item might not be very good as a measure of deviation. A good 
deviation item would be one for which there were rather distinctly 
unpopular choices with no overwhelmingly popular alternative. This 
argument requires further empirical appraisal but is related closely 
to the third general problem, i.e., the difficulties raised by the multi- 
dimensional nature of deviation scores, and the problems associated 
with the interpretation of a global index of deviation. 


Summary and Conclusions 


A critical analysis and review of issues raised by studies bearing 
on Berg’s deviation hypothesis was undertaken with the aim of 
explicating certain issues relevant to the hypothesis: viz., assump- 
tions implicit in the deviation hypothesis, the problem of generality, 
response styles and deviation, definitions of deviation, and problems 
of measurement and interpretation. 

Very early in our attempts to understand the deviation hypothesis, 
we concluded that without further clarification and definition of 
concepts such as “deviant response patterns” and “general” in the 
statement “Deviant response patterns are general... .” (Berg, 
1955), and of the corresponding measurement operations of the 
concepts, the hypothesis was altogether too broad and lacking in 
clarity to permit differential prediction. Further elaborations of the 


50 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


hypothesis, some of which are not attributable to Berg, reveal com- 
plexities, inconsistencies, and several problems worthy of systematic 
study. 

Our conclusions are stated in the following ten propositions. 

1. Deviant response patterns are less than completely general in 
the sense that they are all associated in the general population. 

2. The expectation that criterion groups will differ in some re- 
spects from the general population on certain behavioral measures 
reflects a general affirmation of faith in the psychology of individual 
differences, and it should not be considered as an hypothesis con- 
firmable in the usual specific sense (Feigl, 1951; Bolles & Messick, 
1958). 

3. If it is admitted that deviant response tendencies are less than 
completely general, then it is appropriate to inquire into the prop- 
erties of items or measures differentiating deviant subgroups. The 
content of items or scales may be one important relevant property 
determining differential responses. 

4. Response patterns of criterion groups which deviate from nor- 
mals may appear similar due to imposed restriction of available 
alternatives on structured questionnaires, rather than necessarily 
being due to underlying identities in psychological processes. Such 
data are not necessarily interpretable as evidence supporting the 
generality of deviant response patterns. 

5. Where interest is focused upon unique psychopathological pro- 
cesses, contrasting responses of particular deviant groups will yield 
more useful information concerning differential processes than will 
contrasting deviant groups from normal subjects, particularly in 
cases in which response alternatives are limited and massive ге- 
sponse styles effects predominate. 

6. Deviant response tendencies may often be better understood in 
terms of particular relatively uncorrelated response styles, such 48 
tendencies to acquiesce and respond desirably, operating in a given 
assessment device, then in terms of “general deviation.” 

T. There are a variety of separate measurement definitions of 
deviant response tendencies having very different implications for 
theory and assessment. Deviant response tendencies may be 
considered in terms of: (a) absolute, or (b) relative deviation; 
(с) statistical infrequency; (d) extremeness of traits; (e) unique 
structuring of traits; and (f) randomness of responding. Deviation 


SECHREST AND JACKSON 51 


measures based upon each of these definitions should receive sys- 
tematic study. 

8. Serious problems in the measurement of deviant response tend- 
encies arise out of the application of global indices of deviation. 
Measures should be constructed and studied yielding information 
regarding: (a) the direction of deviation; (b) the atypieally non- 
deviant performance; (c) multidimensionality due to consistent 
individual differences in deviant responses to different classes and 
types of content, item formats, item wording and style, and other 
response determinants. 

9. Gain in precision of measurement of deviant response tenden- 
cies will result from an analysis of optimal item splits, taking into 
account and balancing the gain from using extreme “right-wrong” 
proportions in identifying deviant responses with the loss of infor- 
mation about the total group inherent in such extreme item splits. 

10. While Berg has performed a valuable service in emphasizing 
the importance of studying deviant response patterns, the study of 
such patterns should be increased in scope and complexity to take 
into account the many ways in which different people may be devi- 
ant, the role of different classes of content in eliciting deviant re- 
sponses, and particular types of noncritical deviations unique to а 
given psychopathological group, among other things. New analytical 
methods for treating data are required to do justice to the complex- 
ity of deviant response patterns. 


REFERENCES 


Adams, Н. E. and Berg, I. A. “Schizophrenia and Deviant Response 
Sets Produced by Auditory and Visual Test Content." Journal of 
Psychology, LY (1961), 393-398. 

Barnes, E. Н. “The Relationship of Biased Test Responses to Psy- 
chopathology.” Journal of Abnormal and Social Psychology, LI 
(1955), 286-290. У 

Barnes, E. Н. “Factors, Response Bias, and the MMPI.” Journal of 
Consulting Psychology, XX (1956), 419-421. (a) 

Barnes, E. H. “Response Bias and the MMPI.” Journal of Consult- 
ing Psychology, XX (1956), 371-374. (b) ўт 

Berg, I. A. “Response Bias and Personality: The Deviation Hy- 
pothesis." Journal of Psychology, XL (1955) , 60-71. 

Berg, I. A. “Deviant Responses and Deviant People: The Formula- 
tion of the Deviation Hypothesis." Journal of Counseling Psy- 
chology, IV (1957), 154—161. 3 

Berg, I. A. “The Unimportance of Test Item Content.” In B. M. 


52 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Bass and I. A. Berg (Editors), Objective Approaches to Per- \ 
sonality Assessment. New York: D. Van Nostrand, 1959. 

Berg, I. A. “Measuring Deviant Behavior by Means of Deviant Re- 
sponse Sets.” In 1. A. Berg and В. М. Bass (Editors), Conform- 
ity and Deviation. New York: Harper and Brothers, 1961. 

Berg, I. A. and Collier, J. S. "Personality and Group Differences in 

e Response Sets.” EDUCATIONAL AND PSYCHOLOGICAL 
MzasunEMENT, XIII (1953), 164-169. 

Berg, I. A., Hunt, W. A., and Barnes, E. H. Te Perceptual Reaction 
Test. Evanston, Ill., 1949. 

Bolles, R. and Messick, S. “Statistical Utility in Experimental In- 
ference.” Psychological Reports, ТҮ (1958) , 223-227. * 

Brogden, Н. E. “Variation in Test Validity with Variation in the 
Distribution of Item Difficulties, Number of Items, and Degree 
of Their Correlations.” sychometrika, XI (1946), 197-214. д 

Burnham, Р. $. and Crawford, А. B. “Тһе Vocational Interests and 
Personality Test Scores of а Pair of Dice." Journal of Educa- 
tional Psychology, XXVI (1935), 508-512. r 

Cronbach, L. J. “Response Set and Test Validity.” EDUCATIONAL 
AND PSYCHOLOGICAL Measurement, VI (1946) „475—494. 

Cronbach, L. J. “Further Evidence on Response Sets and Test De- ’ 
Signs." EDUCATIONAL AND PsycHoLocicaL MEASUREMENT, X 

(1950), 3-31. 

Cronbach, L. J. “Proposals Leading to Analytic Treatment of Social 
Perception Scores.” In В. Tagiuri and L. Petrullo (Editors); Per- | 
son Perception and Interpersonal Behavior. Stanford: Stanford 
University Press, 1958, pp. 352-379, X 

Cronbach, L. J. and Warrington, W. G. “Efficiency of Mgltiple- 
Choice Tests as a Function of Spread of Item Diffieulties." Psy- = 
chometrika, XVII (1952), 127-147. MONS 

Edwards, A. L. The Social Desirability Variable in Per олай Азу M 
,Sessment and Research. New York: Dryden Press, 195 Er " ДА; 

Feigl, Н. "Confirmability and Confirmation." Review of In ye Ж. 
bon ا‎ ы гар), 252 Reprinted in Р, P. Wiener; ^ t 

е Philoso of Science. D ibners 

М Sons, 1958; pp. 522" Sse? у of Science. New York: ¢ Sp 

ullixsen, H. "The Relation of Item Difficulty and Interitem- Co! 

И е A "m Variance and Reliability." Psychometrika, 

esterly, S. О. and Berg, I. A. "Deviant Respons Indicators of: 
Immaturity and Schizo; hrenia.” Ji E l- 
Тас XXII (1958), 389-395, Роу 
ackson, D. №. "The Measurement of Perceiv. ity Trait 1 
Relationships." Air Force Office of Sciens т api \ 
cal Report, 1958. Reprinted in N. ууа, (Editor), Values, 
ecisions, and Groups. New York: Pergamon Press, in press. , 

Jackson, D. N. апа Messick, S. J. “Content and Style in Personality { 
Assessment," Psychological Bulletin, LV (1958), 243-252. T: 

Jackson, D. N. and Messick, 8. “Acquiescence and Desirability as 


» 


pe 

кў 
О 
СЫ 


SECHREST AND JACKSON 53 


Response Determinants on the MMPI.” EDUCATIONAL AND Psy- 
CHOLOGICAL MEASUREMENT, ХХІ (1961), 771-7922 

Jackson, D. N. and Messick, S. “Response Styles and the Assess- 
ment of Psychopathology." In S. Messick and J. Ross (Editors), 
Measurement in Personality and Cognition. New York: John 
Wiley & Sons, 1962. (a) 

Jackson, D. N. and Messick, S. “Individual Differences in Social 
Perception.” Journal of Clinical and Social Psychology, 1962, in 
press. (b) 

Jackson, D. N. and Messick, Ө. “Response Styles on the MMPI: 
Comparison of Clinical and Normal Samples.” Journal of Ab- 
normal and Social Psychology, LXV (1962), 285-299. (c) 

Loevinger, Jane. “The Technic of Homogeneous Tests Compared 
with Some Aspects of ‘Scale Analysis’ and Factor Analysis.” 
Psychological Bulletin, XLV (1948), 507-529. у 

Lord, Е. М. “The Relation of the Reliability of Multiple-Choice 
Tests to the Distribution of Item Difficulties." Psychometrika, 
XVII (1952), 181—194. Р 

Lykken, D. Т. “А Method of Actuarial Pattern Analysis." Psycho- 
logical Bulletin, LIII (1956) , 102-107. н 

Roitzsch, J. C. and Berg, I. A. “Deviant Responses as Indicators of 
Immaturity and Neuroticism." Journal of Clinical Psychology, 
ХУ (1959), 417-419. 

Sechrest, L. and J ackson, D. N. “Social Intelligence and Accuracy of 
Interpersonal Predictions.” Journal of Personality, XXIX 
(1961), 167-182. (a) ; 

Sechrest, L. and Jackson, D. N. “Тһе Generality of Deviant Re- 
sponse Tendencies.” , Pennsylvania State University Research 
Bulletin, No. 18, 1961. (b) 

Sechrest, L. and Jackson, D. N. “The Generality of Deviant Re- 
sponse Tendencies.” Journal of Consulting Psychology, XXVI 
(1962) , 395—401. 

pns p. K., Jr. Manual for Vocational inum Blank for Men. 

tanford: Stanford University Press, 1935. < р 

Tucker, L. В. “The Extension of Factor Analysis to Three Dimen- 
sional Matrices.” Paper read at Thurstone Hall Dedication Con- 
оное, Edueational Testing Service, Princeton, New Jersey, 

Tucker, L. R. and Messick, S. “Individual Differences in Multi- 
dimensional Matrices." Princeton, N. J.: Educational Testing 
Service Research Memorandum, 1960. ^. : 
iggins, J. S. “Definitions of Social Desirability and Acquiescence 
in Personality Inventories.” In 8. Messick and J. Ross Careri 

Я u easu b in Personality and Cognition. New York: John 

.. Wiley & Sons, 1962. М divo Жайды 

Yigg J. S. "Strategic, Method, and Stylistie Variance in the 

© MMPI.” Psychological Bulletin, LIX (1962), 224-242. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 1, 1963 


RANK ORDER TYPAL ANALYSIS 


LOUIS L. McQUITTY 
Michigan State University 


Given a category of people, a type can be defined as a subcate- 
gory of » people of such a nature that everyone in the subcategory 
is more like each of the other n-1 persons than he is like any other 
person in any other subcategory. Using this definition, a method of 
typal analysis has been developed (McQuitty, 1961). It starts with 
a matrix of interassociations between people. The next step is to ar- · 
range all of the indices of association in rank order. A more expedi- 
tious method arranges the associations of every column, separately, 
in rank order, and then builds submatrices that satisfy the above 
definition of a type. 

The method can be applied to indices of association between peo- 
ple, institutions, tests (or test items) and other subjects. 


The Method 


If a submatrix satisfies the above definition of a type, then it will 
contain no rank larger than the number of persons in the type. Sup- 
pose, for example, that we have a type composed of two persons t 
and j, i being most like i and second most like j andj in turn 
being most like j and second most like ї. The matrix of ranks would 
be as shown in Table 1. 


TABLE 1 
Rank Orders in an Hypothetical 
See 


- 
— 
ne. 


56 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


As another example suppose that î, ј and k constitute a type in the 
sense that i is most like û, second most like j and third most like k; 
j is most like j, second most like i and third most like k; and k is 
most like k, second most like j and third most like 7. The matrix of 
ranks would be as shown in Table 2. 

It ean now be shown that a submatrix (drawn from a larger 
matrix of ranks) constitutes a type if it contains no rank larger than 
the number of cases in the submatrix. 

Let M represent a matrix of interassociations between m persons, 
with entries in the diagonals to represent perfect associations. Let 
the indices of association of each column be converted to ranks, to 
yield matrix R. Let Т represent any submatrix of t persons (t<m) 
drawn from the matrix R of such a nature that no rank of T exceeds 


TABLE 2 


Rank Orders in an. Hypothetical 
T'ype of Three Persons 


t. Then we have a category of persons of such а nature that every 
person in Т is more like the other t-1 persons than he is like any 
person left in R. Consequently, the persons of Т constitute a type, 
by our definition of а type. 4 
The problem is to select from R all of the submatrices which fulfill 
the definition of a type. Let i represent any person in R, and let № 
containing т individuals represent any submatrix from R in which i 
enters to form a type. Then i would have all the ranks from 110% 
inclusive with the individuals of N (including 7 himself). Conse- 
quently, to find the types into which i enters, we need only starb 
with i, withdraw the one person j who is most like i and test the 
resultant submatrix for fulfillment of the definition of a type. If the 
submatrix contains no rank larger than two, it represents a type; 
otherwise it does not. 
Select the person k next most like 4 and test the submatrix of 1, j 
and k for fulfillment of a type. If the submatrix contains no rank 
larger than three, it represents a type; otherwise it does not. The 
submatrix of i, j and k can qualify as a type even though the sub- 


LOUIS L. McQUITTY 57 


TABLE 3 


An Hypothetical Type of Three Persons Containing 
a Моп-Туре of Two Persons 


matrix of i and j fails to do so. An hypothetical result of this kind is 
shown in Table 3; 7 and j alone do not qualify because they contain 
а rank larger than 2, but i, j and k qualify because they contain no 
rank larger than 3. 

The above procedures would continue until all m persons of the 
original matrix M had been chosen in order of their similarity to 1. 
The suecessive tests would yield all of the types in which iis a mem- 
ber. By applying the method to every individual, we would isolate 
all types reflected by the data. 


An Illustration 


The method is illustrated by applying % to the data of Table 4, 
which reports agreement scores between companies in terms of 32 
variables assessing union-management relationships. Every com- 
pany was evaluated as either above or below average on every yari- 
able. Two companies agree on a variable if they are both either 
above or below average on the variable, but not if one is above and 
the other below average. The agreement score for two companies 18 
the number of variables on which they agree. 

Companies A and B are in the construction industry, C and D— 
trucking, E—grain processing, F—metal products and G and H— 
garments manufacturing. 


TABLE 4 
Agreement Scores between Companies* 

A B С р Е Е G 4 
A 32 29 16 16 14 6 11 A 
B 29 32 17 17 13 6 8 y 
Cc 16 17 32 26 10 8 9 
D 16 17 26 32 10 12 11 п 
Е 14 13 10 10 32 21 17 2 
Е 6 6 8 12 21 32 19 || 
G 11 8 9 11 17 19 32 e 
H 7 10 13 il 13 17 24 

ENEMIES a РВА ШЫННАН 


Data from McQuitty, 1954. 


58 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 5 
Agreement Scores of Table 4 Converted to Ranks within Columns 


A B C D E F G Ha 
A 1 2 4 4 4 74 54 s 
B 2 1 3 3 5) 74 8 7 
С 3} 3} 1 2 7} 6 7 44 
D 3} 3} 2 1 7À 5 5% 6 
Е 5 5 6 8 1 2 4 4 
F 8 8 8 5 2 1 3 3 
G 6 7 7 6} 3 3 1 2 
H 7 6 5 6} 5} 4 2 1 


In Table 5, the agreement scores of Table 4 have been conve 
to ranks within each column separately. 

The first step is to select any company (A in this case) and the 
one other company second most like it, B in this case (counting A 
as most like À). Use A and B as the first two cases in building suc- 
cessive submatrices for testing all of the types in which A occur z 
Then add companies C, D, E, G, H, and F in this order because the y 
are the third, fourth, fifth, sixth, seventh, and eighth most like A. The © 
successive submatrices are shown in Table 6. " 

When the submatrix included only two companies, A and B, 
largest rank was 2, equal to but not exceeding the number of са 
in the submatrix, and therefore the two companies constitute a type, 
as shown by the ё under “Classifications” in the first column of the 
table. E 

The next step is to select the company third most like the original 
company and add it, together with its appropriate ranks, to the sub- 
matrix for Companies A and B. ] 

Companies C and D are tied for the third and fourth ranks. Bo h 


TABLE 6 s 
Ranks of Table б Rearranged to Reveal all Types in which A is a Member 3 
n (No. of 
Classifi- No. of cases in E Ur o 
cation ranks > n submatrix) A 1 DRAN 4 5 3 
t 0 2 Deore es" 8 A 
б : 3 CES noe 1 7 7 4B 
4 DP Sesh? а т 5 & 
в 5 5 Eb MN B1 4 Ж 
с 5 6 gogo TENET 3-1 04 
: 1 7 Brest ages 64 5) 2 1 
0 8 Tso КЕ 2 3.42 


LOUIS L. McQUITTY 59 


were therefore assigned a rank of 3%. In the case of ties, it is im- 
material as to which of the tied companies is selected first. Company 
C was selected in this case. 

Company C brought in three ranks greater than 3 (the number of 
companies at this stage), and consequently the first three companies 
AB and C do not constitute a type. They are therefore classified с 
for category in the column on the extreme left of the table. 

The next company to come in was D, which is tied with C. In the 
new submatrix composed of the four companies, A, B, Cand D, the 
highest rank is four, equal but not exceeding the number of cases 
in this submatrix. Therefore, this submatrix constitutes a type and 
is so classified in the column on the extreme left of Table 6. 

The method proceeds in the same fashion until every entry of 
Table 5 has been transferred to Table 6. 

When the last variable from the rank-order matrix (Table 5) has 
been transferred to the table of successive submatrices (Table 6), 
the last submatrix necessarily contains no rank higher than the 
number of cases, for this condition always holds for the matrix from 
which they were transferred, and the result does not justify class- 
ification of the submatrix as constituting a type; no cases are left 
in Table 5 which could have disturbed the typal requirement of the 
last submatrix. Therefore, the last submatrix (using all of the cases) 
must always be classified as indeterminate. This result is shown by 
the question mark under “classification” in Table 6. 

A comprehensive approach would be to start with every company 
and follow it through in the same fashion as we did with Company 
A. This is good for a comprehensive check on accuracy, but it is not 
essential in order to isolate all types if the analysis is known to be 
accurate throughout. Starting with B, for example, would yield the 
same types, AB, and ABCD as did starting with A. 

Starting with C, however, would yield the types cD and ABCD. 

Even though Type CD is not as obvious as Type AB in Table 6, 
it can be discovered quite readily by observing the intersections of 
Rows C and D with Columns С and D. As a matter of fact it can 
also be discovered in the same rows and columns of Table 5. y 

The function of the method is to ensure that every type contain- 
ing any individual, i, is made to stand out. 

Other types can also be discovered in both Tables 5 and 6. The 
intersections of Rows and Columns EF reveal a type, EF, and, like- 


60 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


wise, the intersections of Rows and Columns G and H reveal a t 
GH. However, the intersections of Rows and Columns EFGH do not 
yield a type of four companies, nor do any three of them yield a 
iype. This latter fact is made more obvious by blanking out any 
row and corresponding column, such as for G, and then observing 
to see if EFH constitute a type; they do not because their intersec- 
tions include ranks greater than three, the number of cases. 


Results 


АП of the types reflected in the data are AB, CD, ABCD, EF and 
GH. This result is consistent with the nature of the data; Companies 
A and B are in the construction industry; C and D are in trucking; 
E and F are in grain processing and metal products, respectively; G 
and H are in garments manufacturing, and the latter two companies 
employ females while the other six employ males. 


Advantages of the Method 


One of the advantages of the method is that it can reject an hy- 
pothesis of types. This ability is a direct result of the strict defini- 
tion of types out of which the method was developed. The method 

ters from most other pattern analytie approaches in this respect, 
hierarchical syndrome analysis, for example (McQuitty, 1960). The 
latter method forces data into types. This is the reason why the 
TU. method classified EFGH into а type, and typal analysis does 
no 

After the types have been formed, a prototype may be identified 
as either (a) the one person most like the members of his type, or 
(b) а composite of the members of a type, and typal relevancies 
may be computed as outlined elsewhere (McQuitty, 1957 and 1961a) 
to reveal the extent to which each member resembles his prototype. 

Another advantage of rank-order typal analysis is that it reports 
the exceptions to forming a type. When ABC (Table 6) were ех- 
amined as a type, the members were found to involve three excep- 
tions; C has too large a rank with A and both A and B have too 
large a rank with C. 

When typal analysis fails to 
definition of types, it is still Sometimes 


LOUIS L. McQUITTY 61 


individuals with ranks slightly larger than those allowed by the 
stringent definition. 

A typal structure can sometimes be improved by applying an item 
analysis for the selection of configural items (McQuitty, 1961c), 
This can be followed by another typal analysis, applied to the items 
which best differentiate the types (or type-like categories), and this 
reapplication can sometimes yield more exacting types than were 
obtained originally. 

Technical advantages of the method include its simplicity, its ap- 
plicability to all kinds of indices of association, between either 
people or tests, and its potentiality for being easily programmed to 
analyze large matrices on computers. 


Summary 


This paper develops and illustrates a simple paper-and-pencil 
method for the isolation of types, applicable to any index of associa- 
tion, and easy to program for the analysis of large matrices on 
computers. Furthermore, the method has the rather unique ability 
of either substantiating or rejecting the hypothesis of types, as well 
as the ability to isolate them at any level of exactness desired. 


REFERENCES 


McQuitty, L. L. Pattern-Analysis: “A Statistical Method for the 
Isolation of Types.” In W. E. Chalmers, M. K. Chandler, L. L. 
McQuitty, R. Stagner, D. E. Wray and M. Derber (Ed.) Labor- 
Management Relations in Illini City, Volume IL, Champaign, 
[шшш Institute of Labor and Industrial Relations, University 
of Illinois, 1954. Д . 

McQuitty, L. L. “Elementary Linkage Analysis for Isolating Or- 
thogonal and Oblique Types and Typal Relevancies. "o 
ae AND PSYCHOLOGICAL MEASUREMENT, XVII (1957), 207- 

29. И 

McQuitty, L. L. “Hierarchical Syndrome Analysis.’ кошсон 
AND PSYCHOLOGICAL MEASUREMENT, (1960), ce jeu 

McQuitty, L. L. “Elementary Factor Analysis.” Psychologica 
ports, IX (1961), 71-78. (a) 

McQuitty, L. a heal REC EDUCATIONAL AND PSYCHOLOGI- 
CAL MEASUREMENT, ХХІ (1961), 677-696. (b) vb ae 

McQuitty, L. L. "Item Selection for Configural Scoring” Bonot- 
iu AND PsvcHorocrcAL MEASUREMENT, XXI (1961), 

. (e) 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 1, 1963 


CUTTING SCORES AND ERRORS OF 
MEASUREMENT—A SECOND CASE! 


FREDERIC M. LORD 
Educational Testing Service 


As pointed out in an earlier article (Lord, 1962), the use of multiple 
cutting scores cannot be rigorously justified in the case where the 
selector variables are less than perfectly reliable. This conclusion 
was illustrated by means of a plausible mathematical model repre- 
senting approximately a certain type of situation in which multiple 
cutting scores are sometimes used. The model chosen had the follow- 
ing characteristic: if the demand for acceptable examinees increases, 
it is logically appropriate to adjust the multiple cutting scores down- 
wards so as to select a larger group. The present article considers 
а somewhat different mathematical model. 

The usual justification given in support of the use of multiple 
cutting scores is that no amount of trait w will compensate for a 
deficiency in trait x, and vice versa. Logically this implies some 
Sort of discontinuity in the mathematical model. In effect, traits x 
and w are being treated as dichotomous variables, since increments 
in z or w are asserted to have no importance for determining the 
value of the examinee as long as his relation to the cutting scores 
İS not affected. For example, if the cutting scores are set at a = 50 
and w = 50, then a person with z = 90 and w = 49 is treated as 
having no more value than a person with z = 90 and w = 19. 

If, now, the demand for acceptable examinees increases, we can 
hardly justify setting new cutting scores of, say, 2 = 30 and w = 30, 
* This work was supported by contract Nonr-2752(00) between the Office of 


ej Research and Educational Testing Service. Reproduction in whole or in 
or any purpose of the United States Government is permitted. 


63 


64 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


thus asserting that an examinee with z = 90 and w = 49 is now 
really better than the examinee with х = 90 and w = 19. This line 
of reasoning leads us to consider a new mathematical model in which, 
given that the selector variables are perfectly reliable, it is logically 
inappropriate to adjust the multiple cutting scores so as to select 
а larger group. If demand increases, the total number of examinees 
to be tested must be increased, or else the demand cannot be met 
(unless, of course, it is admitted that the previous method for de- 
termining the value of the examinee was incorrect and must now 
be changed). 

Such a model will be discussed here. It will be found that the 
conclusions reached under this model appear to be very much the 
same as those reached under the original model. 


Mathematical Formulation 


Only the case of two selector variables will be considered explicitly, - 
although the generalization to more variables will be obvious. If 
the selector variables are infallible (i.e., contain no errors of measure- 
ment), they are denoted by w and £. It is to be assumed here that 
in the infallible case the use of multiple cutting scores is appropriate. 
The cutting scores will be denoted by w = wo and £ = ġo. 

We now ask: Given that these multiple eutting scores provide an 
optimum selection procedure in the infallible case, what is the opti- 
mum selection procedure if all we have are w and z, these being 
fallible measures of w and £, respectively? 

In the mathematical model to be used here, in accordance with 
the picture drawn in the preceding paragraphs, the value, y, of any 
examinee will be y — P, say, if he falls above the cutting score on 
both w and £; otherwise it will be y — F. Without loss of generality 
the metric for y will be scaled so that Р = land F = 0. The metrics 
used for w and z will be scaled so that ij = z = 0and s, = 8, = 1. 

In addition to the foregoing, 
used. 

l. The errors of measurement defined by w — w and by x — # 
each have an expected value of zero and are distributed independ- 
ently of each other, of w, of £, and of y. 

2. After obtaining equation (2), it will be assumed for illustrative 


purposes that the variables w, ¢, w, and z have a multivariate nor- 
mal distribution. 


the following assumptions will be 


FREDERIC M. LORD 65 
Derivation 
The problem of finding an optimum selection procedure is equiva- 
lent to the problem of finding a "region" of the space defined by 
the selection variables such that the expected value of y at any point 
inside this region is as large as or larger than the expected value of y 
at any point outside the region. Thus any optimum selection region 
is bounded by a contour line of the surface representing the regres- 
sion of y on w and z. In accordance with standard notation, this 
surface is denoted by E(ylw, x), which may be read аз “the expec- 
tation of y when w and z are given." 
We start out with the information, provided by the assumptions 
— of the model, that 


W AR a 
Ely |о < mo) = E(y|£ < &) = 0. 
It follows directly that 
i (2) Ey | w, 2) = Prob (w > w > & | w, 2), 


the right side being the probability that w > wo and $ > fo simul- 
taneously when w and = are given. 

In order to proceed further, some assumption must be made about 
the joint distribution of some of the variables. Let us assume that 
u, & w, and т have a normal multivariate distribution, which can 
| thus be completely specified by its means, variances, and inter- 
correlations. 
It follows from assumption 1 that 6 = = 0 and Ẹ = 2 = 0. It 
the reliability coefficients, rw» and fsz, have been determined experi- 
mentally, the correlation matrix for the four variables can now be 
expressed in terms of known quantities by means of standard for- 

mulas from mental test theory. With the supradiagonal elements 
. Omitted and with the standard deviations shown in brackets in the 
- diagonal for convenience, the correlation matrix for w, t, w, x, in 
_ that order, is 


(3) re/ Мота (WV Teal 


VETO т. Vra Ш 
res V Tus Aa CRE 


66 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


The conditional distribution of о and £ when w and x are given 
will be a normal bivariate distribution. The mean value of w for this 
conditional distribution, Mg. „z, lies on the appropriate regression 
surface: 


(4) MASS = 0.20 ye ЖЕ 


and similarly for M ;.,.. The b's are the usual partial regressio 
coefficients, e.g., 


Tow — Taz! wx ` 
(5) NE c 1 1 Ti. Su, 
where s, is the standard deviation of w. 
The standard deviation of w in the conditional distribution (with 
w and z fixed) is the standard error of estimate 


(6) Sees = 9, V 1 — ID. 


where ®„.„„ is a multiple correlation coefficient. A similar "T 
may be written for s;.,.. The correlation between w and £ in th 
conditional distribution is the usual partial correlation denoted by 
To Ez 
Thus the desired conditional distribution of w and £ can be speci 
fied completely from the quantities available in (3). The remaining 
problem is to find out how much of the frequency in this conditional 
distribution lies above the cutting scores w = wy and ё = fo, a8 r€- 
quired in equation (2). Since tabled values are available for м 
standardized normal bivariate distribution, one goes to the pub- 
lished tables (Pearson, 1924; U. 8. Bureau of Standards, 1959) to 


find, for specified values of w and z, the frequency cut off by the 
relative deviates 


(7) h = (wo — Ма on) E 


k= (& — М...) в... | 

when the correlation is r, guz | 
Thus the value of E(y|w, т) сап be calculated for any desired t? 
and z once wo, £y, Tos; Tew, and r.» are given. The contour lines 0 
E(y|w, х) are obtained by inverse interpolation in the published 
tables—one finds various pairs of values of A and k that cut off some 
specified frequency, E (y|w, т). Each (h, k)-pair defines a pair of values 
of w and z. The contour line is the line containing all such points 


FREDERIC M. LORD 67 


Î for the specified value of E(y|w, 2). Each contour line defines an 
optimum selection region. 

In the usual case where the selector variables are fallible, the 
contour line used to determine the selection region may appropri- 
ately be shifted one way or the other so as to obtain more or fewer 
-selectees. As stated earlier, under the present model, such a shifting 
of cutting scores could not be helpful if the selector variables were 
perfectly reliable. 


Illustrative Example 


| In order to illustrate the present model, an optimum selection 
region was calculated for the case where o, = fo = 0, Tus = .60, 
and ru, = rz, = .90. In order to permit comparison, these values 
| were chosen to be the same as those used in the first illustrative ex- 
ample for the earlier model (Lord, 1962). It was found that Dau. = 
M biu, = 84375, busu = Dever = 09380, Rows = Ree. = .95200, 
Saws = Spe. = .2904, and Totus = -1106. When w = z = 0, the 
required value in (2) was found from the bivariate tables, after using 
(4) and (7), to be E(ylw, z) = .2675. The corresponding contour line 
is obtained by finding in the table, in terms of h and К, other pairs 
of values (h, k) which cut off .2675 of the frequency of the normal 
bivariate distribution. These values (h, К) are then converted to 
pairs of values of w and x by substituting (4) into (7) and solving 
the two resulting simultaneous linear equations for w and т (this is 
| best done numerically rather than algebraically). The pairs of values 
_ (w, x) thus found are then plotted in Cartesian coordinates and con- 
„nected by a smooth curve representing the contour line E(y|w, €) = 


‘and to the right of this curve. 

A comparison of this curve with the corresponding boundary 
‘shown in Figure 1 of the reference cited shows the two curves to be 
graphically indistinguishable. Since two-way inverse interpolation 
En the normal bivariate tables is somewhat hazardous, a more refined 
“Comparison of the curves will not be attempted. 

+ The main purpose of the present development is not to find rigor- 
| ously exact selection regions, but rather to see if the previously 
stated conclusions still hold in a general way when the assumptions 
and the mathematical model are substantially altered. The new re- 
sults obtained support the earlier conclusion that “anyone now 


E 


2675, as shown in Figure 1. The selection region is the area above . 


68 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


using multiple cutting scores with relatively unreliable predictors 
would probably do well to alter his selection procedures so as to. 


allow a high value on one predictor to compensate at least partially 
for a low value on another." 


: Figure 1 
Optimum Selection Region for rww = Tee = 90, rw. = 60 


REFERENCES 
Lord, F. M. “Cutting Scores and ы : 
matrika, SVE (КТ тога of Measurement.” Psycho: 
Pearson, K. Tables for Statisticians and Biometricians, Volume 
London: Cambridge University Press, 1924. j 
U. 8. Bureau of Standards. Tables of the Bivariate Normal Distribu- 
tion Function and Related Functions. Applied Mathema 
Series 50. Washington: U. S. Government Printing Office, 1959. 


EDUCATIONAL AND PsYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 1, 1963 


APPLICATION OF AN HIERARCHICAL GROUPING 
PROCEDURE TO A PROBLEM OF GROUPING PROFILES! 


JOE Н. WARD, JR. лмо MARION E. HOOK 


6570th Personnel Research Laboratory, Aerospace Medical Division 
Air Force Systems Command, USAF 


INVESTIGATORS often desire to group large numbers of persons, 
jobs, or objects into smaller numbers of mutually exclusive classes 
in which the members have similar characteristics. When the group- 
ing is done in a manner that establishes a taxonomy of mutually 
exclusive clusters wherein each larger unit is a unique combination 
of the next-subordinate units, the clusters are called “hierarchical 
groups.” Such groupings have proved particularly useful for classi- 
fication purposes. For example, plants and animals may be hier- 
archically grouped with respect to genetic characteristics; library 
holdings grouped in terms of their contents to facilitate storage and 
retrieval of information; persons (or jobs) grouped in terms of 
Specified characteristics for purposes of personnel administration. 
While grouping ordinarily results in some loss of information, it may 
generate new information as well as increase the efficiency with 
which large masses of data can be considered. Therefore the tech- 
nique is applied in many situations. 

Until recently, grouping has rarely been done by mathematical 
techniques. The procedures now available require arbitrary or sub- 
Jective decisions; optimally homogeneous groups are not formed, 
and the loss resulting from the grouping is not quantified. Much 
attention has been given to two technical problems, namely, the 


measurement of the relative similarity of profiles and the formation 
— 
1 The methodological research on which this report is based. was sponsored 
У Personnel Laboratory, Aerospace Medical Division, Air Force Systems Com- 
mand, under Project 7734, Task 773403. 


69 


70  EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


of groups of profiles. Many different ways of handling the first prob- 
lem have been discussed in the past decade (Rao, 1952; Cronbach & 
Gleser, 1953; Helmstadter, 1957; Thorndike, Hagen, et al., 1957; 
Sawrey, Keller & Conger, 1960). Two approaches to the second 
problem are currently used (Sawrey, Keller & Conger, 1960). To the 
best of our knowledge, however, no one has applied an hierarchical 
grouping procedure to a problem of classifying profiles. As we shall 
show, this approach is advantageous when computer facilities are 
available. 

Ward (1961) has mathematically described a procedure for form- 
ing hierarchical groups of mutually exclusive sets in a manner that 
satisfies any stated criterion of the investigator? The purpose of the 
present report is to describe an application of Ward's technique to 
& problem of grouping profiles. The data for 25 profiles to which the 
general computer program has been applied are those given by 
Sawrey, Keller, and Conger (1960, p. 661). These authors compare 
the results of their technique of grouping profiles with the results 
of а Q-technique factor analysis. Hence the interested reader can 


examine the results of three different treatments of the same profile 
data. 


Method 
Hierarchical Grouping Procedure 


Objective function. The hierarchical grouping procedure is based 
on the premise that the most accurate information is available 
when each individual constitutes a group. Consequently, as the num- 
ber of groups is systematically reduced, k, k — 1,... , 1, the cluster- 
ing of inereasingly dissimilar individuals will yield less precise infor- 
mation. The extent of the inaccuracy associated with grouping can be 
quantified by a value-reflecting number derived from an objective 
function. This objective function may be any functional relation that 
reflects the investigator's criterion. 


*'The general computer program for the hierarchical grouping of variables 
has been used at Personnel Laboratory in recent studies "s Aix di се different 
Air Force problems. "The results of these investigations are now being prepar! ed 
for publication. The report of an unusual, operationally useful application to ® 
iu of clustering criteria was recently published (Bottenberg & Christal, 

з The choice of a functional relation to be used as jecti ion obvi- 
ously depends upon the nature of the problem da T ni "the criterion 
selected by the investigator. Take as examples three objective functions use 


WARD AND HOOK 71 


2 


At each stage in the profile-grouping problem described here, the 
goal is to form a group such that the sum of the squared within- 

up deviations about the group mean of each profile variable is 
minimized for all profile variables in all groups at the same time. 
A satisfactory objective function for this purpose can be formulated 
and described mathematically. Since the criterion and its associated 
objective function have been arbitrarily selected for use in this study, 
it is appropriate to denote the function in vector terms that may sug- 
- gest convenient ways of expressing other criteria. Given n individuals 
each of whom has been observed on р characteristics, we define the 
- following vectors: 


y = а vector, of dimension np, which has as elements the ob- 
4 served values of the p characteristics 


ES set of np predictor vectors of the form 


zx = a vector, of dimension np, in which the elements are 1's 
if the corresponding elements of y came from the rth 
person on the sth characteristic; and 0's otherwise 


[г = LE e 0,558 = Ts Pl 


We can now express y as a linear combination, 


Ngoc" + aage? Hee 
" (n.p) 
zB Gee? Tod aet xat aai 
Where 
e = a residual vector = the null vector, since y can be expressed 
1 : в . (rye 
without error as a linear combination of the vectors x 


Е: р = 1, -n8 = 1, *** Pl: 


Я The purpose of the grouping operation, then, is to unite groupe 
in a manner that reduces the number of predictor vectors [^^] by 


мы studies at Personnel Laboratory: (a) crosstraining time: Air Force 
8 grouped in mutually exclusive categories so as to minimize the time needed 
(b) 4 E men from their present jobs to other jobs in the same beg 

ono time-demands: tasks assigned to job incumbents in the same Air 
simil аша ladder grouped in mutually exclusive classes so as to maximize the 
live effin, of the time expended on tasks in the same cluster; and (c) ией 
nical reg "cy: regression equations used to predict success 1n Air Force tech- 
 predieti ools grouped in mutually exclusive categories so as to minimize loss in 
é ctive efficiency (Bottenberg & Christal, 1961). 


4 
Г 


а. _._ 


72 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


p at each iteration while keeping the elements of the error vector 
near zero. The objective function can be given by 
i,j,k =1] = De 
т=1 


where 


e = the error vector that results from the reduction of k groups 
to k — 1 groups by uniting group û with group j. 


Accordingly, if we define 


Yrs, = the observation of the sth characteristic for the rth person 
in the gth group 
F= 1, ee ,п,;8 = T: +,р;0 = 1565 Ald — 1], 
the value of the objective function, expressed as a sum of squared 
deviations (SSE), is given by 


а-л E(EXe-EBRG«J] 


a=1 g-1 r= 
ae a zu Eno of er x 

ил FF (..)] 
Hierarchical grouping. Since Ward's computer program yields а 
complete hierarchical structuring, it is unnecessary to specify in ad- 
vance the number of groups to be formed or to select the nuclei of 
potential groups.* Given a total of k sets, or groups, this program 
insures their reduction to k — 1 sets with the least possible impair- 
ment of the optimal level of the objective function at each stage in 
the grouping operation. The SSE reflecting the optimal level of the ob- 
jective function is 0.00. This occurs in this application when there 
are 25 groups each containing one profile. The reduction to 24 groups 
(1 group having two profiles; 23 groups each having one profile) is 
made by considering all possible 25 (24) /2 pairings of the 25 pro- 
files and selecting that pairing for which the objective-function 
value is the smallest. Thus the first iteration results in pairing the 
two most homogeneous profiles in the array. Moreover, the cost of 
this grouping is available in terms of the SSE associated with 24 

groups. 
* The mathematical description of the procedure given by Ward (1961) in- 


cludes formulae for determining both the be i rming 
groups and the number of distinguishable ni so oa m 


—— Фр ee «4-00 o -——— .-"L——— 0. H"V-——"w;E), — 
— тетт e ull > 


Pyrat SE” 


WARD AND HOOK 73 
Repetition of this process permits systematic reduction of the 


number of groups, 24, 23, . . . , 1. At the outset of the second and 


subsequent iterations, each group previously formed is treated as 
one unit, regardless of the number of profiles in the group. Hence 
the grouping procedure routinely involves consideration of the effects 
of: (a) pairing each two of the remaining one-profile groups, (b) 
pairing each remaining one-profile group with each previously 
formed profile cluster, and (c) pairing each two previously formed 
profile clusters. The pairing selected is always that which reduces by 
one the number of groups while minimizing the objective-function 
value. 

When the complete hierarchical solution has been obtained, the 
SSE values may be compared to ascertain the relative homogeneity 
of the groups formed at different stages in the process. A sharp in- 
crease in the objective-function value indicates that much of the 
classification system’s accuracy has been lost by reducing the num- 
ber of groups by one at this stage. Such information on the relative 
“costs” of different numbers of groups provides valuable guidance 
Whenever it is necessary to decide upon the specific number of pro- 
file categories to be used for classification purposes. 


Profile Data 


For convenience, the matrix of d?'s used to illustrate the hierarchi- 
eal grouping procedure is reproduced here from Sawrey, Keller, and 
Conger (1960, p. 661). This matrix summarizes the distance func- 
tions obtained for each pairing of 25 Conger-Wilson Designs Test 
Profiles, which were obtained from Air Force enlisted men. The d 
for each pair has been computed by squaring the difference in scores 


_ 0h each profile element; and summing these squares. The six profile 


elements are “designed to show the subject's relative preferences for 
а variety of formal art elements (e.g, warm vs. cool color, strong vs. 
Weak contrast, symmetry vs. asymmetry, ete.)” (Sawrey, Keller & 


Conger, 1960, р. 662). Scores on each element have a possible range 


Írom 0 to 10. 

The objective function used in the hierarchical grouping technique 
Was computed from the d? matrix. This is readily done, for it can be 
demonstrated that if we identify any two persons in the gth group 


asr апат”, 


74 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
HELE 5 Bes es] 
Ў Ў wE 20.) 


a-1 gl rel 


where (as before) 


т = person [r = 1, +++, ne] 
g = group [0 = 1, ---, k—1] 
s = characteristic [s = 1, ‘°°, р] 
У.а, = observation of the sth characteristic for rth person in the 
gth group. 
Results 


The complete hierarchical structuring of the 25 profiles that 
minimizes the objective function at each stage is given in Table 2. 
This grouping was completed within approximately 15 minutes on an 
IBM 650 Computer. With the exception of the last three rows, Table 
2 is a copy of the computer's printout of the final results of the 
hierarchical grouping program. While repetition of the profile-iden- 
tity numbers in the body of the table may seem unnecessary, it is à 
convenience in reading results of large-scale studies (n — 100 to 
1,000) and permits programming the computer to start the printout 
at any specified number of groups. In large studies, the final stages 
of grouping often are of greater interest than the early stages; and 
so the results of the grouping operation are given in “reverse” order. 
When Table 2 is read from right to left, a comparison of Columns 
24 and 25 shows the boundary line between Profiles 1 and 14 has 
disappeared, Column 24, then, indicates Profiles 1 and 14 are the 
first to be grouped, hence the most homogeneous. Comparison of 
Columns 23 and 24 shows that the next boundary line to disappear is 
between Profiles 7 and 24, indicating that these are the next to be 
grouped with least impairment of the objective function. 

The last three rows of Table 2 reflect the cost of each grouping: 
These values are readily calculated from information in a computer 
printout (not shown here). Values of the objective function (SSE row 
of Table 2) indicate the over-all accuracy of the classification syste™ 
at each level in the hierarchy. For example, when all 25 profiles are 
in one group, the SSE value is 605.9; at the seven-group stage, the 
SSE value of 104.0 reflects considerable within-group homogeneity; 


75 


WARD AND HOOK 


tc TE 18 Л 2 6 6 c9 x 
OF 79 SP CC * 9 CEE 96 €£c 
89 9F Zo ZZ #9 S£ 6L ст 

8 ӨТ ZS 08 cc 58 69 

tI OF 09 OL £P S 

9% OF F 69 28 

OI FI 28 © 

0g SEI $c 

бӯр Т8 

ZII 

9с с © Gc IZ 0c GE BI M OT 9 


$PLOOSP GE RE Ole 6 
1oquinN Zur£jmuopr 


8 


“(199 "4 *0961) 193009 риз ‘IPH “Ләлм®$ шолу 1 
——— —— a ^. е ———— ———=————— 
er 


вт 
© 

85 
02 
[d 
[43 


66 
86 


Ig 
oF 


OF 
15У 
28 
£9 


28 


ЧУ 9 
84 ТР FG 
F9 дт с 
of £r сс 
ve 201 T6 
Расо 0c 
02 LF 61 
бт 26 SI 
GF $F т 
© FEI 9I 
€6 c6 St 
HI TI 
se б ет 
TO ZŁ [4! 
29 85 It 
FE ee or 
66 96 6 
TL SI 8 
88 26 (А 
801 Se 9 
Scl 66 g 
TET 98 Р 
б LL © 
TET [4 
I 
[4 I 


suosuoq uaanjag Sp fo trip 
1 TEVL 


TABLE 2 
Structure of 25 Profiles Resulting from Hierarchical Grouping Procedure 


Number of Groups 


E 
1 
ш 
3 3 3 
2 12 12 
2.0 1.0 0.0 
1.0 
OO MÀ 


a~ з 
A 
ala a 

Sp 7 3 
hi 
o 
"E ^w n 
м 
we 
a ng ° 
g 
y“ 
ac ^w 6 
yd 
ae 
ЕК na d 


12 
12.8 


5.3 


1 


ya 
t 
23 ^s A 
e 
я" К 
ч 
© 
8l ^ n m 
od 
aie "EX = 


39.0 


3 3 3 3 
12 12 12 12 
$9.5 25.3 21.3 18.0 
5.2 4.2 k.o 3.3 
зыл oe د‎ PP "ч "T mox. РЕ) "T o. 


12 


9 
m 

3 

4 
- ч : 
8 ag 2.1] 
$3, 
и“ MEME C 
a 2? 
al ^N E 
$ 
g^ i 


12 


605.9 439.0 323.5 
266.9 115.5 


WARD AND HOOK 77 


but the SSE value of 65.0 for eight groups indicates that these groups 
have considerably greater homogeneity than the seven groups. The 
rows identified as A1 and A2 facilitate comparisons. The A1 entry 
(39.0) between Columns 7 and 8, for example, shows the change in 
the error that results from merging the cluster containing Profiles 
2 and 16 with that containing Profiles 13, 22, and 21. The A2 entries, 
eg., 27.2 below the A1 entry of 39.0, show the differences between 
the А1 entries and reflect the acceleration of error. 


Discussion 


In this application of the general procedure for hierarchical group- 
ing (Ward, 1961), the objective at each stage has been to form a 
grouping that minimizes the sum of the squared within-group devia- 
tions about the group mean of each profile variable for all profile 
variables for all groups at the same time. In discussing their group- 
ing procedure, Sawrey, Keller, and Conger (1960) speak of forming 
profile groups “in which the group members are similar to each other 
and, at the same time, dissimilar from the members of all other 
groups” (p. 660). Since their method involves selecting nucleus 
groups, it appears that, for a specific number of groups, they sought 
a solution that would maximize the between-group sum of squares 
and minimize the within-group sum of squares for all profile ele- 
ments. The hierarchical grouping procedure described here minimizes 
the within-group and maximizes the between-group sums of squares 
at each level in the hierarchical structure. However, this does not 
necessarily result in the minimal within-group and maximal be- 
tween-group sums of squares for a specific number of groups. 

It is interesting to examine the groups that result from the two 
different techniques of grouping. Let us first compare the initial 
Seven groups produced by the hierarchical technique (Column 17, 
Table 2) with the seven original “dissimilar nucleus groups" estab- 
lished by the Sawrey, Keller, Conger (SKC) method. The SKO 
groups are: I, Profiles 1, 14; II, Profiles 16, 2; III, Profiles 3,12; 
IV, Profiles 9, 23; V, Profiles 13, 22; VI, Profiles 6, 4, 5; VII, Pro- 
files 24, 7, 19. Five of the seven groups in each array are the same 
at this stage. In addition, two of the three profiles in SKC Group 
VI have been grouped in the hierarchical process; the third profile 
1s added at the 15-group stage. The major difference in the two ar- 
rays is that the hierarchical technique groups Profiles 17 and 20 


78 | EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


before it forms SKC Group V (which appears after two more itera- 
tions). Otherwise, the two grouping procedures yield practically the 
same results at this stage. i 

With the Sawrey, Keller, and Conger method of selecting arbitrary 
d? limits as criteria, the first additions to the original disparate 
groups are Profile 21 (Group V) and Profile 18 (Group VII). The 
same profiles are added to the same groups in the hierarchical group- 
ing operation (Columns 12 and 11, Table 2). When d? < 18.17, the 
Sawrey, Keller, Conger technique clusters Profiles 10 and 20 with 
Group V. At the same time, Profiles 17 and 25 are left ungrouped be- 
cause each now can be placed in either one of two groups. This treat- 
ment of Profiles 17 and 25 obviously affects the centroids of the 
groups to which these profiles might have been added and the sub- 
sequent classification of the three remaining profiles. In contrast, 
the hierarchical grouping technique soon clusters Profile 17 with 20 
and Profile 11 with 25. This results in two “isolates” (Profiles 8 and 
15) at the 11-group stage, which is comparable to the final array 
with five “intermediates” (Profiles 8, 11, 15, 17, and 25) produced 
by the Sawrey, Keller, and Conger technique. 

In comparing the two grouping techniques, one must not over- 
look the fact that the hierarchical grouping procedure takes account 
of all profiles at all stages. The seven maximally homogeneous, hier- 
archical clusters are those shown at the 18-group stage. Here, SKC 
groups І, IT, ITI, and IV are duplicated; SKC groups VI and VII are 
each defined by two of their three members. By the time hierarchi- 
cal clusters having the same profile membership as the original seven 
SKC groups have been formed (Column 15) , an eighth cluster (Pro- 
files 17 and 20) is present. At this stage, the cost reflected by SSE 
value is almost 75 per cent greater than it was when the seven maxi- 
mally homogeneous clusters were defined. 

Much additional information is available in Table 2. At the 18- 
group stage, where Profiles 13 and 22 (SKC Group V) as well as 
Profiles 10, 17, and 20 are each a one-profile group, the SSE value _ 
is 10.5. The pairing of Profiles 17 and 20 at this stage results in а 
small SSE increment (2.0). Note, however, the sharp rise in the 
error term when the number of groups is reduced from 12 groups 
(SSE — 29.5) to 9 groups (SSE — 53.2). Detailed information on 
specific profiles is easily extracted. Uniting Profiles 17 and 20, for 
example, produces a smaller increment in error than grouping Pro- 


WARD AND HOOK 79 


file 4 with 5 and 6 (SKC Group VI) or Profile 13 with 22 (SKC 
Group V); Profile 10 is more homogeneous with the 17-20 cluster 
than it (or Profile 21) is with the 13-22 cluster. Furthermore, with 
complete hierarchical structuring, Profiles 11 and 25 are paired be- 
fore Profiles 21 and 18 are added to their respective clusters. As il- 
lustrated here, valuable insights can be derived from the complete 
hierarchical structure of the profiles that results from systematically 
reducing the number of groups one by one, from Ё to 1 in a manner 
that least impairs the objective function. 

This grouping technique has several desirable features. Once the 
objective function has been established, the computer program for 
hierarchical grouping requires few arbitrary decisions. It eliminates 
the need to define nucleus groups or to set limits to specify the 
order in which profiles will be added to nucleus groups. This group- 
ing technique is systematic and replicable. The computations do 
not require an undue amount of machine time. Given a matrix con- 
taining measures of the similarity of profiles, the complete hierarchi- 
cal grouping of 100 profiles can be accomplished in about one hour 
if the investigator has access to equipment such as an IBM 650 
Computer. Since the matrix can contain any measure of profile 
similarity and since the objective function can reflect any criterion 
specified by the investigator, the general computer program is ap- 
plicable to many profile-grouping problems. 

The evaluation of the effects of grouping on an objective function 
is a valuable feature of the hierarchical grouping technique. When 
the investigator can examine both the content of clusters and the 
error associated with grouping for each stage in the hierarchical 
structure, he has complete information for all levels of profile 
homogeneity represented in the sample. This information provides 
operationally useful guidance for handling many problems, e.g., how 
many groups should be used for specified classification purposes, 
which groups should be compared to evaluate the effects of different 
treatments, and so forth. In some situations, use of relatively large 
rather than small numbers of groups may not improve the accuracy 
of the classification system enough to justify associated increases in 
administrative costs or time delays. Under other circumstances, & 
sudden rise in the costs of grouping may indicate that it is not ap- 
Propriate to reduce the number of groups beyond a certain level. 
These are but a few of many possible examples. The specific use 


80 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


made of the information on grouping costs is dictated by the problem 
under study. 


Summary 


An application of an hierarchical grouping procedure (Ward, 
1961) to а problem of grouping profiles is described. The matrix for 
25 profiles (based on an art preference test) to which the computer 
program was applied was taken from Sawrey, Keller, and Conger 
(1960), who applied a different grouping technique and performed a 
Q-technique factor analysis of the data. The profile clusters obtained 
by their grouping technique are compared with those obtained by 
the hierarchical grouping technique. — / i 

The hierarchical grouping technique has a number of desirable 
characteristics. Any measure of profile similarity may be used in the 
matrix describing the profiles. There is no need to specify in advance 
the number of groups to be formed, to select nucleus groups, or to 
set arbitrary limits for use in adding profiles to groups. Grouping 
can be based on any criterion expressed and evaluated in terms of 
an appropriate functional relation, or objective function, e.g., SSE. 
The resulting hierarchical structure of the k profiles is that which, 
at each stage (k, k — 1,..., 1), least impairs the objective function. 
Hence the hierarchical grouping technique shows not only the order 
in which profiles are grouped so as to yield an optimal value of the 
objective function when the number of profiles is systematically re- 
duced from k to 1, but also the costs of each grouping. 


REFERENCES 


Bottenberg, В. A. and Christal, R. E. An Iterative Technique for 
Clustering Criteria which Retains Optimum Predictive Efficiency. 
Lackland Air Force Base, Texas: Personnel Laboratory, Wright 
Air Development Division, Air Research and Development Com- 
mand, USAF, March, 1961. (Technical Note WADD-TN-61-30) 

Cronbach, L. J. and Gleser, Goldine С. “Assessing Similarity Be- 
tween Profiles.” Psychological Bulletin, L (1953), 456-473. . 

Helmstadter, G. C. “An Empirical Comparison of Methods for Esti- 
mating Profile Similarity.” EDUCATIONAL AND PSYCHOLOGICAL 
MEASUREMENT, XVII (1957), 71-82. 

Rao, C. R. Advanced Statistical Methods in Biometric Research. 
New York: John Wiley & Sons, 1952. 

Sawrey, W. L., Keller, L. and Conger, J. J. “An Objective Method of 
Grouping Profiles by Distance Functions and Its Relation to 
Factor Analysis." EDUCATIONAL AND PSYCHOLOGICAL MEASURE” 
MENT, ХХ (1960), 651-674, 


WARD AND HOOK 81 


Thorndike, R. L., Hagen, Elizabeth P., Orr, D. B. and Rosner, B. 
An Empirical Approach to the Determination of Air Force Job 
Families. Lackland Air Force Base, Texas: Air Force Personnel 
and Training Research Center, August 1957. (T'echnical Report 
AFPTRC-TR-57-5) 

Ward, J. H., Jr. Hierarchical Grouping To Maximize Payoff. Lack- 
land Air Force Base, Texas: Personnel Laboratory, Wright Air 
Development Division, Air Research and Development Com- 
mand, USAF, March 1961. (Technical Note WADD-TN-61-29) 


ЭЕ —— PU TT -~ ~ 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 1, 1963 


AN EXPERIMENTAL DETERMINATION OF THE 
OPTIMAL SCORING FORMULA FOR A HIGHLY-SPEEDED 
TEST UNDER DIFFERENT INSTRUCTIONS 
REGARDING SCORING PENALTIES 


WILLIAM B. MICHAEL 
University of California, Santa Barbara 
AND 
ROGER STEWART, BRUCE DOUGLASS, лхр J. Н. RAINWATER? 


Los Angeles County Civil Service Commission 


Problem 


Ir was the purpose of the study to determine experimentally 
whether there are differential effects upon the level of performance 
of young adult male subjects, on a highly-speeded clerical aptitude 
test of homogeneous two-choice items, relative to the use of a num- 
ber of different scoring formulas, when varied instructions are given 
to examinees concerning the scoring procedure to be employed. 
Specifically, answers were sought to the following questions: 


(1) To what extent are the differences in the mean levels of per- 
formance of examinees significant when four different scoring pro- 
cedures are employed relative to three types of directions regarding 
Penalties associated with so-called guessing? 

(2) What are the mean estimates of reliability based on use of 


1 Now with the International Business Machines Corporation, Owego, New 

ork. 

*'The authors wish to express their gratitude to numerous staff members of 
the Civil Service Department who made valuable contributions to various 
phases of this study. The authors are also grateful for the personal attention 
and assistance of Mr. Dale Madden, International Business Machines, Los 
Angeles, and for the use of the IBM computing facilities in the Western Data 

rocessing Center, Graduate School of Business Administration, University 
of California, Los Angeles. 


$4 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


parallel test forms for each of the scoring formulas used under differ- 
ent instructions concerning scoring of items? 

(3) From the pattern of intercorrelations between sets of various 
types of formula scores within the same test form and between the 
same types of formula scores upon pairs of parallel test forms (each 
of which was administered under three different kinds of instruc- 
tion regarding scoring procedure), what inferences, if any, can be 
made regarding optimal scoring procedure to be employed? 


Procedure 


The experimental study was conducted in two phases that em- 
bodied somewhat different experimental designs in view of the ex- 
istence of certain practical limitations of physical space for test 
administration. One purpose of the initial aspect of the project was 
to obtain additional evidence concerning the equivalence of three 
alternative forms of the test of Speed and Accuracy in Checking 
Records, relative to which an affirmative finding would yield a basis 
for use of larger samples in а second experiment with only two test 
forms that would be expected to yield somewhat more definitive 
results. 

Subjects. Young male Marine Corps? recruits, ranging in age be- 
tween 17 and 22, were divided on the basis of General Classification 
Test scores into two large groups with comparable means and 
standard deviations. For the first phase approximately 180 ех- 
aminees were available for study, whereas in the second phase ap- 
proximately 300 recruits constituted the pool of subjects. Relative to 
each phase, a different experimental design was employed. 

Tests. In the first phase, three forms (A, B, D) of a clerical apti- 
tude test, Speed and Accuracy in Checking Records, were adminis- 
tered, each form consisting of 90 homogeneous true-false items with 
a time limit of only 6 minutes. For the second phase, an additional 
form C of the clerical test was used. 


Experimental Design 1 


In the first phase of the study, two samples, each consisting of 90 
examinees, served to make up the experimental group and control 


3 Without the generous cooperation of Brigadier General R. C. Weede, United 
penus ки e me made possible the testing of Marine Corps Personne 
at the Uni tates Marine Corps Recrui i i cou 
not have been made. тре Recruit Depot, San Diego, thie study 


WILLIAM B. MICHAEL, ЕТ AL. 85 


group. In turn, each of these two samples was subdivided into three 
groups each of approximately 30 subjects, which may be denoted as 
E,, Eu, and Em, and Cı, Cu, and Cin. Within each of these six 
subgroups, a counterbalanced order of presentation of the three test 
forms was developed. For each experimental subgroup à different set 
of instructions (three in all) was presented to the examinees prior 
to their taking each new form, but for the three control subgroups 
the same instruction (no comment concerning scoring method) was 
employed for each new form. In the first instruction no comment was 
made regarding the scoring procedure to be used in the evaluation 
of the test results, although examinees were encouraged to work as 
rapidly and accurately as they could. As to the second instruction, 
the candidates were exhorted to “work as rapidly as you can for 
your score on this test will be based on the number of correct re- 
sponses. It will be to your advantage to guess because you will not 
be penalized for wrong answers.” That a penalty for so-called guess- 
ing was involved in the third set of instructions was apparent from 
the monitor’s statement, “It will be to your disadvantage to guess 
because wrong answers will be counted against you. Your score 
will be based on the number of right answers minus the number of 
wrong answers.” Thus, the instructions to examinees concerning the 
scoring procedure to be employed may be summarized as follows: 


Instruction 1. No comment concerning penalty for guessing. 
Instruction 2. Encouragement to guess—no scoring penalty. 
Instruction 3. Exhortation not to guess—scoring penalty. 


Scoring formulas. It was decided for both phases of the experi- 
ment that four scoring formulas would be used for each test form 
under each instruction: (1) the number of right answers, R; (2) 
the number of right answers minus one-half the number of wrong 
responses, R—W/2; (3) the number of right answers minus the 
number of wrong ones, R—W; (4) the number of items omitted, Om 
(including number of items skipped and number of items not 
reached). 

Paradigm for experimental design 1. The basic feature of the first 
experimental design may be summarized as shown in Paradigm I. 

Statistical treatment in experimental design 1. From a study of 
the paradigm it is apparent that, relative to application of each 


86 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Paradigm 1 
Cd 
Testing Experimental Groups Control Groups 
Fx e MEME О. Cu 

1 Inst. 1 Inst. 1 Inst. 1 Inst. 1 Inst. 1 Inst. 1 
Form А Form В FormD Form А Form В Form D 
2 Inst. 2 Inst. 2 Inst. 2 Inst. 1 Inst. 1 Inst. 1 
Form B Form D Form A Form В Form О Form A 
3 Inst. 3 Inst. 3 Inst. 3 Inst. 1 Inst. 1 Inst. 1 
Form D Form A Form B Form D Form A Form B 


scoring formula, comparisons could be made between experimental 
and control subgroups for each corresponding test form and that 
initial differences in means between respective control and experi- 
mental subgroups could be evaluated and allowed for in subsequent 
comparisons associated with the second and third testing periods 
(see Table 1). Moreover, similar comparisons for each scoring form- 
ula could be made between the two groups as to the size of correla- 
tion coefficients arising between equivalent forms administered dur- 
ing period 1 and period 2, period 1 and period 3, and period 2 and 
period 3—comparisons that would permit an assessment of the mag- 
nitude of reliability estimates for different scoring formulas and an 
indication of the possible influence of specific scoring instructions 
upon reliability estimates from differences in coefficients obtained 
between the experimental and control groups. In addition, the 
counterbalanced presentation of the tests furnished a basis for yield- 
ing data concerning the comparability of forms. In view of the very 
large numbers of possible permutations, for ease of conceptualiza- 
tion, mean intercorrelations derived from the three experimental 
subgroups and from the three control subgroups would be desirable 
and justifiable in light of the apparent comparability of forms (see 
Tables 2 and 3). 

An obvious defect of the experimental design was the confound- 
ing of the order of instruction regarding scoring procedure with 
practice effects gained during the three periods of testing. Because 
of the existence of only two rooms, it was not possible to counter” 
balance the order of instructions. Such a procedure would have 
necessitated six examining rooms. Additional rooms were not avail- 
able at the time of testing. It is quite possible that a reversal of the 
order of instructions 2 and 3 especially would have led to somewhat 
different effects in both net changes in mean scores and in estimates 


WILLIAM B. MICHAEL, ЕТ AL. 87 


of alternate form reliability. In view of such a possibility, a second 
experiment embodying permutation of both test forms and order 
of instructions was conducted in which, with the use of larger groups, 
it was anticipated that somewhat more stable estimates of reliability 
could be obtained. 


Experimental Design 2 


A somewhat simpler design was employed in the second phase of 
the study in that the assumption of the equivalence of four forms 
of the same clerical test permitted the study of possible differential 
effects associated with presentation of instruction 2 followed by in- 
struction 3 and of instruction 3 followed by instruction 2. Because 
of the necessity of administering tests representing several different 
abilities in conjunction with a standardization project, an experi- 
mental design was employed that allowed for intervening practice 
on two other types of tests between the initial administration of 
form A to all examinees without comment (instruction 1) regard- 
ing scoring procedure to be followed, and the subsequent adminis- 
tration of forms B and C to one group of 140 subjects under instruc- 
tions 2 and 8, respectively, and of forms C and B to a second group 
of 156 examinees under instructions 3 and 2, respectively, during 
experimental testing periods 2 and 3. The test form D was admin- 
istered to the two groups during а final fourth period of testing, 
without comment regarding scoring procedure (instruction 1), in 


Testing 
Periods Group I (N = 140) Group II (N = 156) 


2 Inst. 2 Inst. 3 
Form B Form C 
3 Inst. 3 Inst. 2 
Form C Form B 
4 Inst. 1 (4) Inst. 1 (4) 
о o BoD е а 


^ 


88 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


order that possible effects of intervening instructions regarding 
guessing might be evaluated. The experimental design is sum- 
marized in Paradigm II. 

Statistical treatment in experimental design 2. Statistical proce- 
dures comparable to those used in the first experimental design were 
applied to the data derived from the second experiment. In addition 
to the determination of the possible significance of differences be- 
tween mean performance under different instructions regarding the 
practicality of guessing, comparisons of correlation coefficients re- 
flecting the degree of alternate form reliability were made. 


Findings Relative to the First Phase (Based on Experimental 
Design 1) 
From a study of the net changes in mean scores for pairs of ex- 


perimental and control groups cited in Table 1, the findings may be 
summarized as follows: 


(1) It is apparent that, for the three formula scores R, R—W, 
and R—W/2, all nine shifts in average score associated with the 
time interval between instruction 1 and instruction 2 were positive 
(four of them being statistically significant). However, it cannot be 
determined whether the mean gains should be attributed to practice 
effects with item content or to a change in mental set in conjunction 
with the second instruction encouraging the examinees to guess. 

(2) With respect to the same three scoring formulas, the mean 
increments in performance arising between instruction 1 (no com- 
ment about guessing) and instruction 3 (warning not to guess) for 
each of the nine comparisons were in general both somewhat smaller 
than those mentioned relative to instruction 1 and instruction 2, ex- 
cept for the third subset of experimental and control groups, and 
lacking in statistical significance. Thus, it would seem that instruc- 
tion 3 against guessing tended to exert a somewhat depressing in- 
fluence upon performance despite the probable presence of addi- 
tional practice effects associated with the third exposure to test 
material. 

(3) That, for the same three formula scores in six out of nine 
comparisons between instruction 2 and instruction 3, negative 
mean changes arose of which three were statistically reliable (the 
slight positive increments arising for the third pair of experimental 
and control subgroups) would suggest that the advice against guess- 


WILLIAM B. MICHAEL, ET AL. 


үт Jo ‘TONG 10 "1Орү — £10] ү пол [023000 Jo вәлоов usou uoomzoq әопәзә} 
-JIP әцу етши 9р8 jo IANG 10 “IJ — I3J ү dnozJ үгүпәшәйхә jo вәлоов uvour uooajoq әоџәтәртр eqs 03 penbe et І поцопдещ оў quonbesqne g uononneu; Яуда 
чолу Susurs ү dnozê [04100 pue ү 40013 jwjuoururodxo usemyaq 60: Jo IOWI[ — ING soros uvour up вәопәләртр e FY? :eao]oj ew prar oq Аеш Auo өц, q 
© б рю аан е3 
ou потувцоцхә (g) рит “(Азүвпәй ou) eson3 оў упәшәЗълпоәпә (Z) 'eson3 оў Aypqestape Surpivoi упәшшоә оп (T) е pequosep 4рәыҷ әд єпотуәтыцёш әчү, s 
fo уолан ана сле = з '(0Е = JP) PA IO’ 39 вота? we 
FO = 3 ‘(OS = JP) PA СО” 39 3uvogrutis » 


ك ص ЕА И А БАИ ЕВЫ SS‏ = ر اک 


Ly v'a GFF av 24889 яа (с) “84 (£) 
©8°1— qa 607 1— gv er— уа (т) “#4 (£) 
587 va £9'c— qv *10`9— яа (c) “84 (g) 
ce y da Gg av £8'I v'a (1) “sa (£) 
66° av FSF gd „78 L va (1) “84 (c) M- u 
ee" Via 6c 6— av »LT'9— яа (с) ‘ва (€) 
18°F ая Sere av era v'a (Т) "вА (g) 
FGF qv Wt яа 2088 va (D "ва (g) A= 
or va 66`&— ау «+79 G— яа (c) 's^ (€) 
Irs ая SLT av 68 T va (т) "ва (€) 
I£'€ av «#6 яа чье 2072 Va (1) 84 (2) ч 
(gg = HON ‘og = ISTAN) (08 = HON ‘6g = пзу) (ze = ON ‘Te = !3N) (2) "ва (€) рәҳојішя 
рәвйшогу рәлвашогу peredu0g, (Т) "А (£) [nuo qr 
niong — mang swo HONG — Hang вш iong — gq swog (1) ^84 (c) Suuoog 
poreduio?) 
тугу ров niy O pus "y “O pus 1 sdno13 [023002 «Suorjonzjsu] 
ров [ejueurniedxo jo sured 10} вәлоов 480} uwour ur soduwmqo JON * 
= 


aanpaooig bursoog Burpanbay suoyonsjsuy fo SF ээл], о} 321013] 
рәлә}зтиїшрү әләт FDY} SULO 182, 12110204 22441, fo ywq uodn sdnoaj) үодиоу рио |оүмәшымэәйту fo 
sug дэл], fo yg fo sounwmiofieg эч} YN рәүозәовв y 821005 оүтшлод fo spury апоу fo suvajy из вәбиоуо PN 


I W'ISV.L 


90 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


ing in the third instruction did tend to depress somewhat the mean 
level of performance. 

(4) Relative to each of the same three scoring formulas, there ap- 
peared to be no noticeable differential in the relative magnitude of 
the shifts in the ne& mean changes. Thus, use of any one of the three 
formula scores would be expected to yield similar results in terms of 
the amount and direction of the mean changes in performance sub- 
sequent to modifications in instructions concerning the advisability 
to guess. 

(5) As to the scoring procedures in which numbers of omitted 
responses were determined, it was apparent that (a) the net change 
in mean number of omitted items decreased significantly in two out 
of three comparisons when instruction 2 (advice in favor of guess- 
ing) was presented subsequent to instruction 1, although the decre- 
ment could also be related to practice effects (i.e., to more items at- 
tempted and thus fewer items omitted); (b) the net shift in mean 
number of omitted items under the third instruction (advice against 
guessing) as compared with the second instruction (advice in favor 
of guessing) was positive in each the of the three comparisons and 
statistically significant for the first pair of experimental and control 
groups—a finding that would reflect a change in mental set despite 
any influence of familiarity and experience with similar items; and 
(c) the net decrement in the mean number of omitted responses 
when performance under the third instruction (exhortation against 
guessing) was compared with the initial instruction of no comment, 
although only slight in its magnitude, would indicate that the third 
instruction served to inhibit performance in view of what would be 
anticipated to be substantial practice effects arising from two prev- 
ious testing sessions. 

As to changes for the three experimental and three control sub- 
groups in the mean intercorrelations obtained between pairs of four 
different scoring procedures both within the same test form and be- 
tween various test forms that had been administered under three in- 
structions regarding advisability to guess, consideration of the en- 
tries in Table 2 points to the following limited findings: 

(1) Under the same type of instruction regarding the feasibility 
of guessing, the mean correlations for the three experimental sub- 
groups between different formula scores on various pairs of test 
forms were found to be higher than when correlations betwee? 


sz 
“ssn оў Jou FUIUIEA—(DN) £ pus ‘sson3 оў queuraSurnoo 
-u3—(0) z !Suypson3 Япталәәпоэ quoururoo оп— (ON) T MOO se poquosop Ayoug oq Asur sooururexo оў uononageur Jo adt} Zunuosordar suonsusmop eor) eu e 
6L—- 0L— 06— 98 69— 29— 8/— 18 = وو‎ L= ZI wo © 
yL— 66 86 90— FL 22 Gb w= $5 cL tL т 2м — u © 
- 99— 66 +6 99— TL 0L OL 99— 69 69 19 or MASE © 
d _98— 86 #6 gı GL cL 0 02—92 Fh Ll 6 IXY (ON) € 
Ы ав o- i- o- L— 19— 18— $8 IL- 89- Ll- 8 ишо z 
ş = 8 Ze 18 $89— 66 16 9¬ OL 0А 1 2 I/M- uU © 
085 LF lo 8 ©Р— 86 #6 69— 29 19 99 9 A—?H © 
19— 19 9g 29 85— 96 06 Si¬ 9А 08 LL E IXuH (D) 2 
= 18 99— S9— p= ©8 = 07— $89— Lie 8576 т ишо I 
a 65— 9. TL 64 02- gg Sr €9 06— 66 66 £ I/M- UO 1 
à 99— 92 GL LL €£9— 19 9r T9 8¬ 66 16 © AH I 
SL— EL 19 82 27 99 Eid 89 96— 66 €6 I IXu (ON) T 
z ZI It or 6 8 2 9 S 14 € © 1: uoneusrea(q Smu uononijsu 
әд A 3uuoog Jo edÁT jo 
E (ON) (5) (ON) «uorsudisaqt 


uonvusta( [qute A 


(тоиобозт торд sarug) sdnoap) qopuounsoda qq 204], 40f pun (jpuobpaq 2aoqy sonu) SANO) 
10410) 2244, 20] SULLOH 789], 0104D f 224], шош] PINIM 822005 D NULO fo sag впоырД usomjog PUD 
UO 189], TUDES әү] UAL 8210067 D10] PUAI апо] fo sag uaomjog pourvjqQ =чоуо]ә11019)иЈ ирэ др 


€ ИЛЧҮ, 


92 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


sets of different formula scores on comparable pairs of forms were 
obtained, when one test form was administered under one instruc- 
tion about guessing and the second form under a different instruc- 
tion. However, in view of comparable results for the control sub- 
groups relative to each of the three testing periods with respect to 
which the same instruction (of no comment concerning guessing) 
was followed, it cannot be concluded that the occurrence of lower 
correlation coefficients associated with performance measured dur- 
ing different testing periods is a function of the directions com- 
municated to the examinees. 

(2) As would be expected, the omit scores showed substantial 
negative correlations with the three types of formula scores cited 
(R, R—W/2, and R—W) in which the number of correct responses 
is involved. Specifically, the greater the degree of relative impor- 
tance of the number of correct answers in the scoring formulas, the 
higher numerically was the size of the negative coefficient. 

(3) The degree of correlation of formula score R, which yielded the 
number of right answers, with formula scores R, R—W/2, and 
R—W tended to show respectively less relationship the greater the 
size of the negative weight assigned to wrong answers—that is, R 
scores correlated most highly with other R scores, next with R—W/2 
scores, and lowest with R—W scores. 

As to the mean estimates of alternate form reliability obtained 
when the same scoring formula was used for corresponding sets of 
test scores from the various pairs of test forms, relative to diferent 
instructions about guessing for the experimental subgroups and to 
the same instruction of no comment for the control subgroups given 
at the beginning of each of the three corresponding test periods, the 
following three results based on the data presented in Table 3 
seemed noteworthy : 

(1) The highest degree of alternate form reliability was realized 
when the number of omits was correlated between pairs of forms, 
although such an estimate of reliability would probably be propor- 
tional to the lengths of the test for a fixed time limit in that in any 
long examination an administration under short time limits would 
be expected to yield higher reliabilities for scores based on items 
omitted. 

(2) In terms of the scoring formulas R, R—W/2, and R—W, 
the trend was in the same direction for all six comparisons made 


WILLIAM B. MICHAEL, ET AL. 93 


TABLE 3 


Mean Alternate-Form Estimates of Reliability Derived from Three Parallel 
Tests that were Scored in Four Different Ways and Administered to 
Three Experimental Groups Under Three Kinds of Instruction 
Regarding Scoring Procedure and to Three Control Groups 
During Three Corresponding Periods of Time Under 
a Single Instruction (of No Comment) Concerning 


Scoring Procedure to be Followed 
Experimental Groups Control Groups 
Scoring Means of three correlations between Means of three correlations be- 
Formula pairs of test forms each adminis- tween pairs of test, forms each 


tered under different instructions: administered under the same 
instruction of no comment 
during each of three testing 
periods: P, Pa, and Р;. 


(2) vs. (1) (3) vs. (1) (3) уз. (2)* Р, ув. Р, Ps vs. Р, Pa vs. Ps 


RX1 T JU .80 .68 .78 .67 
R—W/2 .70 .73 .74 .55 .75 .54 
R-W .67 .69 .70 .46 .72 AT 
Omits .83 .81 .86 .82 .81 .83 


* The instructions have been previously described in Table 1. 


relative to the three experimental and the three control subgroups, 
as evidenced by the fact that the scoring formula R consistently 
yielded the highest estimates of equivalent form reliability followed 
in order by the estimates resulting from use of R—W/2 and then by 
R—W (irrespective of the pairs of periods of test-taking behavior 
that were compared for the experimental subgroups). 
(3) Estimates of alternate form reliability for the control sub- 
_ groups were highest irrespective of the scoring formula used involv- 
- ing a function of R (i.e., R, R—W/2, or R—W) when scores earned 
during the third period of testing were compared with those received 
during the first testing period. Estimates of reliability for scores 
made during the third period of testing in relation to those obtained 
during the second session, and for scores received during the second 
Period in relation to those achieved during the first session, were 
lower than those estimates based upon the third and first testing 
( Periods, but almost the same in magnitude relative to each scoring 
| formula in turn. Estimates of reliability for the three test forms 
based on the experimental subgroups (receiving three different in- 
structions) were somewhat higher and less variable than those de- 
rived from the control subgroups (receiving one standard instruc- 
tion three times). Specifically, estimates based upon coefficients of 


94  EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


correlation between scores obtained during the second and first 
periods and during the third and second periods were noticeably 
higher for the experimental groups than for the control groups. How- 
ever, with respect to the first and third testing periods the two sets 
of reliability estimates were almost identical. 


Findings for the Second Phase (Based on Experimental Design 2) 


From a study of the changes in mean scores for Groups I and II 
reported in Tables 4 and 5, respectively, the findings may be sum- 
marized as follows: 


TABLE 4 


Mean Differences Between Scores on Pairs of Test Forms and 
their Statistical Significance for Group I (N = 140) 


жоош ——————M OA 
————————Є—Є— 


Corresponding Scoring Formula 
Pairs Instructions Omits 

of Forms Compared» R R-W/2 RW (R+W) 
B-A 2vs.1 9.53** 7.26** 4.99* —14.08** 
C-A 3 vs. 1 6.09** 5.04** 3.98** —8.23** 
D—A 4vs.1 6.83** 4.33** 2.02* —11.66** 
C-B 3 vs. 2 —3.44**  —2.22*  —1.01 5.85** 
D-B 4 vs. 2 —2.700* —2.93** —2.97** 2.44** 
D-C 4 vs. 3 Л4 —.71 —1.96* —3.43** 
* Instructions 1 and 4 are exactly the same in content. Instruction 1 preceded instruction 2 
and 3 and instruetion 4 came at the beginning of the fourth and final testing period. Instruction T 
was given during the beginning of testing period 1; instruction 2 at onset of testing period 2; in- 


struction 3, at start of testing period 3; and finally instruction 4 at beginning of testing period 4. 
* Value of ¢ significant at .05 level. ч 
** Value of t significant at .01 level. 


TABLE 5 


Mean Differences Between Scores on Pairs of Test Forms and 
their Statistical Significance for Group II (N = 156) 


سک سے 


Corresponding Scoring Formula 
Pairs Instructions Omits 
C-A З vs. 1 4.04** 4.16** 4.43** —3.05** 
B-A 2 vs. 1 11.74" 8,49 5,35%  —18.14" 
D-A 4 ув. 1 9.38** 6.45** 3.00** —15.03** 
B-C 2 vs. 3 7.70** 4.33** .92 —14.49** 
D-C 4 vs. 3 5.34** 2.29* —.g3 —11.38** 
D-B 4 vs. 2 -2.36" —2.04* 1.75" 3.11** 


See footnote of previous table regarding instructions 1 and 4. Instructions 1, 
3, 2, and 4 (not 1, 2, 3, 4) were given at the beginnings of testing periods 1, 2, 3, 
and 4, respectively. Note that the order of instructions 2 and 3 is reversed from 
that in the previous table. 
* Value of t significant at .05 level. 
** Value of t significant at .01 level. 


WILLIAM B. MICHAEL, ET AL. 95 


(1) When, in terms of a function of the number of correct an- 
swers (R, R—W/2, and R—W), the mean performance of either 
group during testing periods 2, 3, and 4 is compared with that during 
the first testing period, there is evidence of significant increments 
in mean score that may be interpreted as reflecting practice effect 
(despite the apparent temporary influence upon performance im- 
posed by the direction not to guess—instruetion 3). 

(2) That the third instruction which urged examinees not to guess 
exerted a depressing effect upon scores is apparent for Group I from 
an inspection of the magnitude of the difference in mean scores be- 
tween testing period 2 and 3 in which, despite the presence of prob- 
able practice effects, statistically significant average decrements of 
3.44 and 2.22 for formulas R and R—W/2 and a noticeable but sta- 
tistically unreliable decrement of 1.01 for the formula R—W arose. 

(3) That the second instruction encouraging guessing, given sub- 
sequent to the third instruction advising against guessing, may have 
exerted a facilitating effect for Group II is apparent from the sta- 
tistically reliable mean increments of 7.70 and 4.33 associated with 
the scoring formulas R and R—W/2 and of the nonsignificant in- 
crement of .92 relative to the scoring formula &—W, although ad- 
mittedly practice effects could have contributed an indeterminate 
amount to these average gains. 

(4) In general for both groups the size of increments or decre- 
ments in mean scores from one testing period to another was in- 
versely proportional to the amount of deduction for wrong answers 
—that is, the less deducted for wrong answers the greater were the 
mean differences associated with various pairings of testing intervals. 

(5) In those instances for which increments and decrements in 
formula scores were obtained for both Groups I and II relative to 
Various pairings of testing periods, conversely decrements and in- 
crements, respectfully, in omit scores were found as one would ex- 
pect—a fact that obviously indicates that a reduction in the number 
of marked responses scored as being correct is associated with a 
gain in the number of omitted items. 

Despite the introduction of conflicting mental sets by different 
orders of instruction and the probable interaction of these sets with 
Practice effects, two findings (not reported in tables) involving 
both gross and net positive and negative changes in mean test scores 
for the two groups seem to be noteworthy: 


96 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


(1) With respect to Form D of the test administration given dure 
ing the fourth and final testing period in which no instructional come 
ment was made concerning guessing, statistically significant differ- 
ences in mean standings of 5.63, 4.82, and 3.79 were found relative. 
to the scoring formulas R, £—W/2, and R—W in favor of Group I 
that received instruction 3 (advice against guessing) prior to im 
struction 2 (advice to guess). n 

(2) The net mean changes in test score of Group II between the 
first and fourth period minus the corresponding net mean change 
for Group I relative to the scoring formulas R, R—W/2, and R—W 
yielded values of 2.55, 2.12, and 1.58 (only the first value being sta- 
tistically significant)—a finding indicating a slight advantage й 
net mean net gain for the group receiving instruction 3 (exhorta tion 
not to guess) before instruction 2 (encouragement to guess). 

From the results furnished through use of the second experiment 
design, it may be concluded that instructions rendered concern n 
the feasibility of guessing or not guessing were associated in many} 
instances with statistically reliable increments or decrements in 

average test scores. A 

In Table 6 are furnished for the two groups all possible combina: 
tions of alternate-form estimates of reliability representing the de- 
gree of correlation between different forms of the test that w 
administered during four periods of testing carrying varied sco! 
instructions (although the first and fourth periods had no commeni 


TABLE 6 

Alternale-Form Estimates of Reliability Derived from Four Parallel Tests that 
were Scored in Four Different Ways and Administered to Two Experimental 
Groups Under Three Kinds of Instruction Regarding Scoring Procedure _ 


Combinations of Instruction Concerning 


Scoring Penalty for Guessing* E. 
Group Formula  1vs.4 1vs3 1ув.2 2vs.3 2vs.4 3 
I(N 210) R .80 ‚75 .80 .78 :81 
(Receiving R—-W/2 72 .68 .76 71 ‚74 
Instructions R-W .61 .59 ‚70 .63 .68 
in 1234 Order) Omits 75 77 {76 ‚74 ‚78 
II (У. = 156) В .64 .70 eu ‚78 .70 
(Receiving R-W/2 64 .70 .73 .69 .68 
Instructions R-W .60 .68 .67 .62 .67 
in 1324 Order) Omits ‚51 ‚64 .68 .59 78 


a See footnotes in Tables 4 and 5 concerning the equivalence of instructions 1 and 4 (nO ! 
ment) and the text for the nature of the scoring instructions presented. 


WILLIAM B. MICHAEL, ET AL. 97 


concerning the scoring formula). Despite the fact that the corre- 
sponding reliability coefficients for the second group are usually 
somewhat lower than those for the first group, the same finding as 
in the first experiment was obtained concerning the magnitude of 
the reliability estimates as being inversely related to the size of 
correction for wrong answers. In other words, the reliability esti- 
mates were highest when the items were simply scored R, next high- 
est for R—W/2, and lowest consistently when the formula R—W 
was employed. Additional findings in Table 6 may be summarized 
as follows: 

(1) For Group I the reliability estimates of omit scores were al- 
most as high as those for E scores, but for Group II estimates were 
about equal to those for R—W scores. 

(2) Within the same scoring formula for each group the estimates 
of reliability were highly comparable irrespective of the particular 
pairing of different scoring instructions or extent of temporal separa- 
tion between testing periods in which the scoring instructions were 
presented. In particular the insertion of two hours of testing of other 
psychological functions between the first and second testing periods 
was not associated with any noticeable differential in the magnitudes 
of the reliabilities found for pairs of testing sessions, irrespective 
of whether the earlier or latter testing period was or was not sepa- 
tated by the two hour period of other test activity. 


Summary 


In an experimental investigation involving two separate studies 
based upon slightly different experimental designs for two groups 
of 180 and 296 Marine Corps recruits, an attempt was made to 
determine the optimal scoring formula (such as number of items 
right R, the number right minus one-half the number wrong R—W/2, 
or the number right minus the number wrong R—W) for a highly 
Speeded test of clerical aptitude containing two-choice items in the 
light of three different instructions to the examinees concerning the 
advisability of guessing: (1) a direction in which no comment was 
made regarding the feasibility of guessing, (2) an instruction in 
Which the candidate was encouraged to guess since there would be 
no penalty, and (3) a statement advising the examinee not to guess 
Since there would be a penalty. The major findings may be sum- 
Marized as follows: 


| number of items omitted, although such a finding is extremely de 


98 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


(1) Although substantial practice effects were noted from the 
results furnished by the subgroups studied, it was apparent, as one | 
would expect, that irrespective of the scoring formula employed in- 
volving a function of R (ie. R, R—W/2, or R—W), a net inere- 
ment in average level of performance occurred subsequent to the 
instruction in which guessing was encouraged and that a net decre- 
ment in average level of performance took place following the in- 
struction in which advice against guessing was given. Such changes 
in performance would be substantial enough to necessitate the 
of different sets of norms depending upon the instructions rendered. 
(2) When estimates of alternate form reliability were obtained, the 
highest degree of reliability was associated in one study with the 


pendent upon the time limit for the test and the number of items. 
(3) When a scoring formula embodying a function of R was em- 
ployed, the results in both studies showed a consistent trend for the 
alternate form estimate of reliability to be highest in terms simply 
of scoring the number of items correct R, next highest for using the 
formula R—W/2, and lowest for employing the formula R—W. 


Conclusions 


From the results obtained, it would appear that for the test in- 
vestigated, which would seem to be fairly representative of othe 
speeded tests of clerical aptitude, the scoring of the number of item 8 
correct without penalty for wrong answers would be the preferre 
procedure. However, since in isolated cases a sophisticated candidate 
would be inclined to mark all remaining items in a test form during 
the last 30 to 40 seconds of the testing period and thus to inflate his 
score, it is suggested that in the directions of speeded tests of homoge 
neous material specific mention of a penalty for guessing be cited 
Although ideally one might then proceed to score the papers in te m 
of simply the number of items correct, there would seem to be bow” 
а moral and a legal obligation (especially in civil service testing) М 
score the papers with a formula embodying a penalty for wron 
answers, if the candidate had been advised of an intended applic 
tion of a penalty factor. Thus, in order that the reliability of thi 
scores may be preserved at a relatively high level for the two-choice 
items, it would be recommended from the findings of this investiga 
tion that the scoring formula R—W/2, or perhaps even more &^ 


WILLIAM B. MICHAEL, ЕТ AL. 99 


vantageously the scoring formulas R—W/3 or R—W//4, be applied. 
Such а procedure would be expected to preserve the reliability of 
the instrument, to suppress the response set or tendency of some 
examinees to guess wildly, and to keep faith with the candidates 
who had been advised that а scoring penalty would be applied. Of 
course, replication of this experimental study both with other samples 
of examinees and with other speeded clerical aptitude tests consist- 
ing of homogeneous items embodying two alternatives would be 
highly desirable in view of the possible existence of substantial 
differences in characteristics of motivation and ability in various 
populations of job applicants. 


g 


= S acras 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vou. XXIIT, No. 1, 1963 


SELF, INTERACTION, AND TASK ORIENTATION 
INVENTORY SCORES ASSOCIATED WITH OVERT 
BEHAVIOR AND PERSONAL FACTORS! 


BERNARD M. BASS 
University of Pittsburgh 


GEORGE DUNTEMAN 
University of Rochester 


ROLAND FRYE, ROBERT VIDULICH, Ax» HELEN WAMBACH 
Louisiana State University 


A person's behavior in a group, particularly whether and when he 
attempts to lead it, is strongly affected by whether he is self-ori- 
ented, interaction-oriented, or task-oriented (Bass, 1960). This 
Position is supported by the work of McClelland, Atkinson, e¢ al. 
(1953) ; French (1956); and Fouriezos, Hutt and Guetzkow (1950). 

Task-oriented members are described as most attracted to a group 
by expectations of task success and its rewards. They are individuals 
Who would be reinforced primarily by task effectiveness. They are 
likely to be concerned about getting the job done, solving the group’s 
external problems, and working with persistence on the barriers pre- 
Venting the group from obtaining success in its endeavors. 

Interaction-oriented members are described as reaping rewards 
from the satisfactions of the interaction with others. They are likely 
to be less concerned about getting the job done and about striving 
for Succeeding in solving the group's external problems. Merely 
Maintaining harmonious, conflict-free, relationships with others is 
Most satisfying to the interaction-oriented. 

Self-oriented members are described as attracted to groups in the 


*xpeetation of direct reward to themselves regardless of the task or 

E—— 

d This work was supported by Contract N7 опг 35609 with the Group Psy- 
ology Branch, Office of Naval Research. 


101 


100 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


interaction effectiveness of the group. The group is "merely the 
theatre in which certain generalized needs can be satisfied. The 
other members are both the remainder of the cast as well as an 
audience for which the self-oriented member can air his personal 
difficulties, gain esteem or status, aggress or dominate” (Bass, 1960, 
pp. 149-150). 

This report will describe the Orientation Inventory (Ori) de- 
signed to assess the self, interaction, and task-orientation of ex- 
aminees2 Then, several validity analyses will be presented showing 
the relations of orientation scores to overt behavior and various 
personal factors. 


Development of the Inventory 


Triads of statements were assembled, containing one statement 
deemed likely to be accepted most by the self-oriented examinee, 
one statement deemed likely to be accepted most by the interaction- 
oriented examinee, and one statement deemed likely to be accepted 
most by the task-oriented examinee. The triads dealt with the value 


of groups, personal needs, and projection of one’s values. A typical 
triad was: 


I ADMIRE A MAN WHO: 
A. Inspires others by his personality 
B. Knows how to get everyone working together for the com- 
mon good 
С. Makes a contribution to human happiness and progress 


Examinees were asked to indicate which alternative of each triad 
they agreed with most and which they agreed with least. If а pre 
sumably task-oriented alternative was selected as most agreeable, 
the respondent earned +2 on the task-orientation scale; if it was 
marked as least acceptable, he added nothing to his task-orientation 
score. If the alternative regarded as task-oriented was neither at- 
cepted nor rejected, the examinee added +1 to his task-orientation 
score. The self-oriented and interaction-oriented alternatives were 
scored in the same way to contribute to the examinee’s self-orienta- 
tion and interaction-orientation totals. 


2 For further details, see the manual for the Orientation Inventory; Palo 
Alto: Consulting Psychologists Press, 1962. 


BERNARD M. BASS, ET AL. 103 


Reliability of Scales and Stability of Classification 


After three revisions following independent internal consistency 
item analyses, 27 triads formed the final revision with estimated 
odd-even reliabilities of .50 for self-orientation, .70 for interaction- 
orientation and .64 for task-orientation (N = 100). 

It was guessed that the scales were not factorially pure, each 
measuring more than one factor. Therefore, internal consistency 
estimates would be underestimates of the true reliability of the 
three scales, s, ? and t; consequently, an actual measure was ob- 
tained of the test-retest reliability for each scale. Eighty-four ex- 
aminees, college students, received two administrations of the final 
form of Ori, a week apart. Test-retest correlations were: self, .73; 
interaction, .76; and task, .75. 

Although the obtained reliabilities suggest the scales are less than 
adequate for diagnostic work, it will be seen that they are satisfac- 
tory for isolating three idealized types of individuals. Because of 
the scoring procedures, it is possible to clearly isolate individuals 
who are high in only one of the three scales and either intermediate 
or low on the other two. In this process of separating out individuals, 
there are about 20 per cent who do not emerge as high on any of 
these scales. These provide residual groups to serve as controls. 

The major use planned of Ori was to type people, either as self, 
interaction or task-oriented, and then to contrast the behavior under 
controlled conditions of people so typed. The ultimate reliability 
question therefore was how much error lay in typing an individual 
in one category or failing to type him in that category. 

The 84 subjects for whom test and retest scores were available 
Were typed, based on the first Ori administration in a manner which 
Was to be followed in subsequent experiments. First, individuals in 
the top quartile of each of the three scales were identified. Since the 
Scales are negatively correlated, with few exceptions the three 
clusters of individuals did not contain any of the same people. Those 
5 per cent who scored in the top 25 per cent on two scales were 
Placed in a residual category, along with 20 per cent of individuals 
Who failed to emerge in the top quartile on any of the three scales. 

Perfect stability of classification would have been achieved if 
100 per cent of all subjects of every type had appeared in the diag- 
Опа] cells of Table 1. As can be seen, from 60 to 70 per cent of sub- 


104 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 1 


Classification and Reclassification a Week Later of 84 College Students 
into Self, Interaction and Task-Oriented Types 


Classification 
According to the Second Administration 
self interaction task residual All 
(%) (%) (%) (%) (%) 


) 


Classification self 70.6 0 11.8 17.6 100 
According to interaction 4.8 61.9 4.8 28.6 100 
the First task 4.8 0 66.6 28.6 100 
Administration residual 24.0 20.0 20.0 36.0 100 


x? = 65.9; p < 01 


jects typed s, û, or t on the first administration were in the diagonals; 
that is, in the same з, û or £ category on the second administration. 

Most of the instability of classification occurring for the remain- 
ing subjects, originally typed s, i or t, was in their tendency to move 
into the non-specific residual class on the second administration. 
Only 6.5 per cent of the 84 subjects originally typed as s, ог t found 
their way into a different s, 7 or t category on the second administra- 
tion. No persons typed as either self or task-oriented on the first 
administration were found in the interaction category on the second 
administration. 

Much less stability was shown by those in the residual category 
initially. Only 36 per cent of these initially residual examinees еге 
residual on the second administration; the others found their way _ 
in equal proportions into s, 7 or t classifications. | 

Viewed tactically, the likelihood of classification instability ap- | 
pears mainly in the movement of subjects into and out of the re- 
sidual category. This means that if, for the conducting of an experi- 
ment, we identify a subject as self-oriented, interaction-oriented, OF 
task-oriented, he probably would either be so identified again in the | 
future or he would be sloughed off from experimental consideration 
into the residual group. On the other hand, the majority of those We 
slough off into the residual cluster initially, and fail to use in ОШ 
experiment, are likely to turn up subsequently in a s, ? ОГ t E 
gory. This means we are mainly sacrificing a considerable number. 
of potential s, ? or ¢ subjects. Given an abundant supply of subjects 
this major portion of classification instability is of little conse 
quence. | 


BERNARD M. BASS, ЕТ AL. 105 


Associations with Overt Behavior: Task Completion 


A week after taking Ori, 68 subjects were told in class to work as 
rapidly as possible on a scrambled words task requiring an average 
of 10 minutes to complete. After five minutes, subjects were told 
that they should check the word on which they were working when 
time was called. They also were told that if they liked they could 
complete the problem, although their score was based on how far 
they had gone before time was called. A point-biserial correlation 
of .47, significant at the 1 per cent level, was found when task- 
orientation scores of “completers” were compared with those who 
quit. 

Similar results were obtained when this study was repeated with 
children by Frye and Osborn (in press). A week after completing a 
child’s version of Ori, 62 fourth graders were asked to work on an 
arithmetic examination of 30 minutes; the children were told that 
when the recess bell rang they could either turn their papers in or 
stay during the recess period and check or redo their work. Children 
with high task scores were found most likely to remain rather than 
50 outside and play. 


Working Alone or in Groups 


A second “Scrambled Words” task was given out to the class of 
68, ostensibly to see whether two people working together did better 
than those working alone. Subjects were asked to elect whether to 
work together or to work alone. The point-biserial correlation com- 
Paring the interaction-orientation scores of those who elected to 
Work together and those who wished to work alone was .27, signifi- 
cant at the 5 per cent level. 


Overt Volunteering Behavior 


Th another attempt at examining differences in overt decisions by 
Subjects of supposedly differing orientation, a request for student 
Volunteers was made by several elementary psychology class in- 
Structors. The classes contained a total of 168 students for whom 

Ti scores were available. No particular incentives such as extra 
class credit were offered for volunteering. Volunteers were requested 
for “research activities” in one of two areas: 


100 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


1. group diseussion: subjects will participate in a 50 minute in- 
formal group discussion involving selected topics 
or 
2. solving problems alone: each subject individually will work 
at solutions to selected problems. 


Accompanying the oral request to be made by the instructor (who 
knew the real purpose of the request) was a check list on which each 
student could volunteer for the activity of solving problems alone, 
or the group “discussion.” A volunteer would be used at his con- 
venience during the following two weeks so that no student had to 
concern himself about schedule conflicts. Any student who did not 
wish to volunteer could so state by checking a third space. All stu- 
dents, regardless of their desire to volunteer or not, were asked to 
complete the form. 

The 168 students were classified into 42 self-oriented, 42 inter- 
action-oriented, 42 task-oriented, and 42 residual subjects. Those 
labeled s, i ог ¢ were in the upper quartile in score on the appropri- 
ate s, i or t scale and below the median on the other two scales. 
Those labeled “residual” failed to be in the upper quartile on any 
scale or in a few instances were so on two of the three Ori scales. 


Volunteering Tendencies 


Of the 168 subjects, 71.4 per cent of the 42 task-oriented subjects 
volunteered for one or the other research activities; 61.9 per cent of 
the 42 self-oriented subjects volunteered; and 57.1 per cent of the 
42 interaction-oriented subjects volunteered. 

Appropriate chi square analyses indicated that task-oriented sub- 
jects were significantly (at the 1 per cent level) more likely to vol- 
unteer than subjects in general, including the residual subjects who 
fell in between at 69.0 per cent in their tendency to volunteer. Self- 
oriented and interaction-oriented subjects did not differ significantly 
from each other or from the general volunteering tendency of 649 
per cent for all 168 subjects combined. | 

The greater tendency for task-oriented subjects to volunteer, with | 
no extrinsic rewards for so doing, seemed consistent with the in- | 
tended meaning of task-orientation. Volunteering could be prompted: 
in this case, by a desire to be of service to science, by curiosity, by the 
desire to learn more through a novel experience, and as well by 


BERNARD M. BASS, ЕТ AL. 107 


desire to please one's instructor. But this last matter was probably 
not of great weight for the instructors were graduate student quiz 
leaders meeting with the students only once out of three class hours 
& week. Grades in the course are primarily based on departmental 
objective exams. 

Aside from providing an indication of the validity of the discrim- 
ination between task-oriented and other types of students, these re- 
resulis point to a significant experimental issue. Volunteers for 
psychological experiments with no extrinsic rewards are not an ac- 
cidental sample, but clearly a biased one in favor of task-oriented 
students, 

The greater willingness to work demonstrated here by the task- 
oriented subjects was also shown in another simple unpublished 
comparison by Mr. George Foreman. In line with the meaning of 
task-orientation, ten college students who had been lifting weights 
at a salon for six months or more earned a task-orientation score of 
36.0 in comparison to the normative mean of 31.5 for 233 men at the 
same college, a statistically significant difference at the 2 per cent 
level of confidence. No significant differences were found on the 
other two scales. 

These results are also consistent with a college achievement study 
showing that when students are matched on intelligence test scores, 
those students with higher task-orientation scores earn higher grade 
point averages in college.’ 


Discussion Versus Problems Alone 


Of those who volunteered, 54.2 per cent of the interaction- 
oriented subjects volunteered for “discussion” rather than “solving 
problems alone”; while only 50.0 per cent of the task-oriented, and 
34.6 per cent of the self-oriented subjects volunteered for “discus- 
sion” rather than “solving problems alone.” 

Although interaction-oriented subjects chose discussions to a 
greater degree than did all other subjects, the appropriate chi square 
failed to attain statistical significance. 

The equal choice by task-oriented volunteers of discussion and 
Working alone was the first of many evidences to be reported sub- 


sequently that the “task-oriented” are in no way anti-social or anti- 
e_ 
2 See footnote 2. 


108 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


group; in fact, it will be shown that they are likely to emerge as 
the heroes of group action. Results indicate that, when forced to 
choose, it is the self-oriented subjects (and significantly so) who 
shy away from groups. We infer that the task-oriented subject sees 
the group with as much challenge and interest as he does a problem 
requiring isolated work (See also Bass & Dunteman, in press). 


Effects of Added Extrinsic Incentive 


A week later, the classes were notified that more volunteers were 
needed so that now all volunteers, including those who already had 
volunteered as well as those who now wished to do so, would be paid 
$1.50 for their 50 minute effort. (The $1.50 was fixed by determin- 
ing what 50 per cent of an independent sample of students said 
would be satisfactory for pay for subjects of a psychological experi- 
ment.) 

Of those who had refused to volunteer without incentive, 57.1 
per cent of the self-oriented, 44.4 per cent of the task-oriented, and 
36.4 per cent of the interaction-oriented subjects now volunteered. 

Again, results were consistent with expectation although they 
failed to attain statistical significance, partly because there had 
been only 59 original non-volunteers. Of these 59, 20 or 47.6 per cent 
now volunteered with the added monetary incentive. Yet of the 
original self-oriented subjects, 57.1 per cent now were moved to vol- 
unteer. Consistent with the self-concerns and "what's in it for me?” 
implied in self-orientation, the self-oriented person was most 
prompted to volunteer when his behavior would result in immediate, 
direct rewards to himself. Here the behavioral change of self-ori- 
ented subjects in greatest proportions might be described as illus- 
trating that they, of all types, were most often out for immediate, 
tangible gain for themselves. They were relatively less influenced by 
appeals to be of service to science, curiosity, and other factors which 
had served to motivate 7 out of 10 task-oriented subjects to volun- 
teer. 

Of those who now volunteered when monetary incentives were 
included in the call, every one of the 5 interaction-oriented subjects 
chose “discussion” rather than “work alone"! The 14 self-oriented 
and task-oriented subjects who now shifted from non-volunteers to 
volunteers chose “working alone” or “discussion” about equally. 


BERNARD M. BASS, ET AL. 109 


Orientation Scores Associated with Related Personality Assessments 


A battery of personality inventories, attitude questionnaires, and 
an intelligence test was administered to 110 students in several 
courses in industrial psychology along with 18 of the 27 triads of the 
Orientation Inventory. Table 2 presents the correlations between 
these various personality assessments and the three scale scores 
which were significant at least at the 5 per cent level of confidence. 
The large number of significant correlations were much more than 
might have been expected on any chance basis. 

The correlations also tended to be meaningful and consistent with 
the purposes and intent of Ori. Thus, the highly self-oriented in- 
dividual, according to Ori, tended to describe himself on the several 
inventories and questionnaires listed at the bottom of Table 2 (de- 
veloped by Guilford & Martin, Rokeach, Cattell, e£ al., Taylor, Bass, 
Edwards, Lentz and Wonderlic & Howland) as disagreeable, dog- 
matic, aggressive-competitive, sensitive-effeminate, introvertive, 
Suspicious-jealous, tense-excitable, manifestly anxious, lacking in 
control, immature-unstable, needing aggression, needing heterosex- 
uality, lacking in need for change, fearing failure and feeling inse- 
cure, 

In contrast, the interaction-oriented examinee described himself 
significantly in need of affiliation, socially group dependent, lacking 
in need for achievement, lacking in need for autonomy, needing nur- 
turance, tending to warmth and sociability and lacking in need for 
Aggression, 

In still further contrast, the individual identified as task-oriented 
on Ori described himself on the other eight inventories and question- 
naires as self-sufficient and resourceful, controlled in will power, 
needing endurance, aloof and not sociable, sober and excitable, in- 
trovertive, radical, not dogmatic, lacking in need for heterosexuality, 
heeding abasement, aggressive and competitive, lacking in need for 
Succorance, not in fear of failure, and mature and calm. In addition, 
only the task-oriented scale correlated positively, and significantly 
80, with the Wonderlic test of intelligence. 


Orientation Scores Associated with Related Personal Factors 


Further systematic relations were expected and found pointing 
Occupational, educational, and maturational factors contributing 


110 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


[p 801 ЧИМ 6T = 2 ueqa ср > d 


saugu pasa, GT JP SOT Чим pg = 2ueqa у > d 
gUwo-enj9eW GI" 

өле jo gwepurj3oN OS" (6861) puv[AoH P omiopuo Ms 
09009102008 10J poou jo xovq Ig” (96I) zue'T, SIRT 6r^ 
eAnmoduro-oATSS0122y 1° (F961) SpieA par, аәлцүв} jo req GT” 
quouresequ POON = GG (8961) segs әЗивцо 10J poou Jo xov] 0G" 
Куүепхәводәўәц 10] рәәп Jo JT ec (961) 10148 Lr s Х^уүепхәзолә}әц pooN OZ" 
О}вш8ор JON PZ (4961) "2? 22 TENEO: ›ЧО1ввәл88в pooN IG’ 
[OPH FG (0961) qovexo1rz s9[qejsun-emjeumnu jo 
19Am10Aonuruwmroqog yo (Sh6T) UNEN P ріод; 1023000 Jo VT ўс’ 
ee[quymxe-esua[ FZ" 1189] fo 224nogr »Ajomuv jsopuv]N үс’ 
»Ajerxuw 3s9jruwur јо xov'[ Oc о 001891935 10] paou Jo xowT GT’ s9[qeyoxe-osue] FS" 
:9nsqpear-qsnop Lz" сӘ[981о0в-шеМ GT” eSnoyeel-snoridsng GG’ 
«впоцәв-әчо$ gz" s9ouwnjinu pooN 6T s9ATm19AOlIjur-uemueqoq  9z* 
т 91451206 30u-JoOTV Sc" „Хшопоўпв 10} рәәп Jo xov] ZG se)euruepe-earnmueg 96 
$99uvinpue PION 8 s1uouroAormqow 10} poou Jo xowT JZ" e9Anmeduioo-eATSSOl82y gz" 
eloAod [ru рәоупо) 18" :juepuedop Япо: Aqproog = gz" ;?nvusoq 6c' 
s[njooinosoi-uopogjns-Hgog ££ «чоц paN PE" Pqs gg 

мозюуиәзд()-у#®,], 4 моюунәы()-иоцәрләуи] 4 мозюуиәъ0-Д}25 4 


EYJA ваотувтәлдогу queui0]N-39npodqr 


вәмориәаи] fijijpuosasq PUY PHY PANIG PUY uonvjuor) usomjog suo10j2440;) juvoysubig 
Z ЯТЯУІ, 


BERNARD M. BASS, ET AL. 111 


to orientation. To accomplish these analyses, Ori was administered 
to various groups of industrial management, college students, house- 
wives and spouses, and others. 


Technical Versus No-Technical Education 


Technical college training seems to contribute significantly to еп- 
hancing task-orientation scores. Thus, 19 oil refinery supervisors with 
technical education earned a mean task-orientation score of 36.3, 
significantly greater at the 1 per cent level than the mean of 30.1 
found for supervisors without college degrees in engineering in the 
same company. Conversely, the technical graduates were signifi- 
cantly lower (X = 20.7) in interaction-orientation than the non- 
college trained supervisors (X — 28.4). 

Similar results were obtained in a chemical plant for 74 graduate 
engineers who earned a mean of 19.9 in interaction-orientation and 
а mean of 38.6 in task-orientation. These scores can be contrasted 
with those for 58 men living in the same suburb, aged 34 to 52 in 
varying occupations, who earned a mean interaction score of 25.0 
and a mean task score of 33.0. 

No differences between the three samples were found in self-ori- 
entation, the mean for all samples remaining between 22.4 and 22.7. 

The significance of engineering training is further supported by 
an independent analysis of Purdue University students by A. Mar- 
ston who found engineering students to earn significantly higher 
task scores than the mean for all students tested (Marston & Levine, 
in press). 


Current Occupation 


When the 74 engineering graduates were classified according to 
their jobs, significant differences in meaningful directions were ob- 
served. Thus the 35 technical graduates now engaged directly in 
research and engineering earned а mean task score of 39.6, signifi- 
cantly greater at the 1 per cent level than the 39 technical graduates 
now engaged in supervision, administration, or area people contact. 
Conversely, the latter earned а higher (p < .05) interaction mean 
(20.7) than the 35 men working currently in engineering or research, 
Whose interaction mean was 19.0. Self-interaction scores were prac- 
tically identical (22.4 and 22.7) (Dunteman & Bass, in press). 

A comparable analysis for 25 professional women secretaries 


112 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


again yielded meaningful results. When the 25 were contrasted with 
25 women of the same age (mean age of 36) who were currently at- 
tending day or night school college classes, secretaries earned signifi- 
cantly higher (X = 27.6) interaction-orientation scores than the 
matched students (X — 23.6). On the contrary, secretaries earned 
significantly lower self scores (X = 19.7) than students (X = 
23.6), while both secretaries and women students of the same age 
earned higher task scores than 84 housewives with adolescent chil- 
dren aged 32 to 56. The task scores of the three samples were: 


25 secretaries (mean age 36) 33.7 
25 women attending classes (mean age 36) 34.1 
84 housewives (age 34 to 52) 32.2 


Age-Sez Differences 


To study the effects of same or different orientation on marital 
compatibility of spouses, as well as the contribution of parental 
orientation to their children's orientation scores, 98 adolescents an 
their 142 parents were administered Ori. Analyses of these effects 
will be reported elsewhere. However, the sample differences between 
parents (aged 32 to 56) and their children (aged 12 and 13) as we 
as sex differences are made available for examination here. Table 3 
shows the distributions obtained for the sample, by age and sex. 

As can be seen in Table 3, at all age levels girls and women аг 
more interaction-oriented and less task-oriented than boys and men 

At the same time, there are systematic increases for both sexes 


TABLE 3 
Orientation of Parents and Their Early Adolescent Children 


Mean Orientation 

Self Interaction Task 
Boys (N = 50) 23.0 29.0 28.0 
Girls (ЇЇ = 48) 21.0 30.6 29.2 
Fathers (N = 58) 227 25.0 33.0 
Mothers (N = 84) 21.6 27.0 32. 

Standard Deviations in Orientation 

Self Interaction Task 
Boys (N = 50) 4.8 5.1 5.4 
Girls (N = 48) 5.1 5.4 
Fathers (N — 58) 5.6 5.7 5.4 


Mothers (N = 84) 5.6 \ 5.8 


BERNARD M. BASS, ET AL. 113 


in task-orientation with increased age and education, coupled with 
reduced interaction-orientation. 

АП the differences between samples in task and interaction-ori- 
entation are significant. 


College Achievement 


Given the ability, task-oriented subjects are overachievers in 
comparison to those of equally high ability earning lower scores in 
task-orientation.t Among 95 college students, those in the top 
quartile in task-orientation and below the median in interaction and 
self-orientation earned a grade-point average of 2.41 (A = 4.0) 
while the remaining subjects earned a grade-point average of 2.27. 
Since intelligence was correlated with task-orientation, grade-point 
average of task-oriented subjects with IQ's over 110 were com- 
pared with other subjects of comparable IQ's, and grade-point aver- 
ages of task-oriented subjects of IQ's below 110 were compared in 
corresponding fashion as shown in Table 4. 

Task-orientation makes no difference among subjects below 110 
in IQ; but above 110, subjects who are task-oriented earn higher 
grade-point averages (X = 2.68) than those who are not task- 
oriented (X — 2.44). Both task-orientation and IQ correlate posi- 
tively with grade-point average as well as with each other; but their 
combined correlation with grade-point average undoubtedly is 
greater than the simple sum of their covariances with grade-point 
average. (Subsequent larger sample analyses of the multiple cor- 
relation are now in progress.) 


Summary 
This article describes the development and reliability of the 
Orientation Inventory (Ori—an inventory to assess self, interaction 
TABLE 4 


Grade-Point Average of 95 College Students 
as a Function of IQ and Task-Orientation 


——— ŰŮ_ р ———_———————— 


Non-Task Oriented Task-Oriented All 
——  NenTaskOriented —— ‘Task-Oriented A 
IQ above 110 2.44 2.08 2.56 
IQ below 110 2.10 2.10 2.10 
All 2.97 2.41 2.33 
= 


n. нА is an unpublished analysis by Roland Frye, Donald South, and Joel 
utle: 


114 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


and task-orientation). In its fourth revision based on internal con- 
sistency analyses and relevant evaluation, Ori consists of 27 triads 
scored in the same manner as the Kuder Preference Record, requir- 
ing the examinee merely to choose the one alternative in each triad 
he prefers most and the one he prefers least. The inventory contains 
questions about the value of groups, personal needs, and value pro- 
jections. 

The test-retest reliabilities for the three scales are as follows: 
self-orientation, .73; interaction-orientation, .76; and task-orienta- 
tion, .75. These results are based on 84 college students taking two 
administrations of the current edition a week apart. 

Overt decisions to volunteer for various kinds of “research ac- 
tivities” and to continue working after interruption were found as- 
sociated with Ori scores. 

Task-oriented subjects were significantly more likely to volun- 

' teer in the absence of any extrinsic appeals, while more non-volun- 
teering self-oriented subjects shifted to volunteers when offered pay 
for their time. After volunteering, interaction-oriented subjects were 
most prone to prefer a discussion over a problem to be solved alone, 
while self-oriented subjects preferred working alone. These overt 
behaviors were consistent with the meanings of self, interaction, and 
task-orientation. However, the fact that task-oriented subjects 
chose discussion and working alone equally as much suggested а 
need to clarify the meaning of task-orientation. The task-oriented 
member is not particularly averse to interacting with others; rather, 
it is the self-oriented member who is most likely to exhibit this 
tendency. 

Self-orientation scores were higher among dominant-aggressive 
subjects with various neurotic characteristics. Self-orientation was 
significantly less likely among professional women secretaries than 
women of the same age currently attending college. 

Interaction-orientation scores were associated with needs for 
affiliation and dependency, decreasing with age. Interaction scores 
were higher among supervisors and secretaries than among matched 
non-supervisory engineers or matched women. Interaction-orienta- 
tion was higher for women at all age levels, although significantly 
lower for parents than for their adolescent children. 

Task-orientation was higher among parents than their adolescent 
children, among engineers, and engineering students than compa- 


BERNARD M. BASS, ET AL. 15 


rable examinees, among those of higher intelligence and among per- 
sons describing themselves as stronger personalities: self-sufficient, 
sober, mature, independent, ete. 

Task-oriented students are higher in college achievement than 
equally highly intelligent classmates of lower task-orientation. 
However, task-orientation apparently has no effect when students 
average lower than 110 in IQ. 


REFERENCES 


Bass, B. M. “Famous Sayings Test: General Manual.” Psychological 
Report Monograph, 1V (1958), 479—497. f 
Bass, B. M. Leadership, Psychology and Organizational Behavior. 

New York: Harper and Brothers, 1960. 
Bass, B. M. The Orientation Inventory. Palo Alto: Consulting Psy- 
chologists, Inc., 1962. d 
Bass, B. M. and Dunteman, С. “Behavior in Groups as a Function 
of Self, Interaction, and Task Orientation." Journal of Abnormal 
and Social Psychology, in press. f 

Bernberg, R. E. Human Relations Inventory. Chicago: Psycho- 
metric Affiliates, 1959. : ? 

Cattell, В. B., Saunders, D. R., and Stice, б. The Sixteen Personality 
Factor Questionnaire. Champaign, Illinois: Institute for Per- 
sonality and Ability Testing, 1957. . pay 

Dunteman, С. and Bass, В. M. “Supervisory and Engineering As- 
signment Associated with Self, Interaction, and Task Orienta- 
tion.” Personnel Psychology, in press. ч 

Edwards, А. L. The Personal Preference Schedule. New York: Psy- 
chologieal Corporation, 1954. 

Fouriezos, N. T., Hutt, M. L., and Guetzkow, H. Measurement of 
bn aq ers Needs in Irem Сере Journal of Abnormal 
and Social Psychology, 1 , 682-690. р 

French, E. С. “Effects 4 the Interaction of Feedback and Motiva- 
tion on Task Performance.” American Psychologist, XI (1956), 
395 (Abstract). t Р 

Frye, В. L. and Osborn, M. “The Effect of Orientation Type on Task 
Completion of Elementary Grade Students." Psychological Re- 
ports, in press. р 

Guilford, J. P. and Martin, N. Guilford-Martin Personnel Inven- 
tory. Beverly Hills: Sheridan Supply Company, 1943. i 

Lentz, T. F. The Conservatism-Radicalism Questionnaire. St. Louis: 
Washington, Character Research Assn., 1946. hi. 

Marston, A. R. and Levine, E. M. "Interaction Patterns in a College 
Population." Journal of Social Psychology, in press. 5 
MeClelland, D. C., Atkinson, J. W., её al. The Achievement Motive. 

New York: Appleton-Century -Crofts, 1953. f 
Rokeach, M. The Open and Closed Mind. New York: Basic Books, 
1960. 


116 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Taylor, J. A. *A Personality Scale of Manifest Anxiety." Journal of 
Abnormal and Social Psychology, XLVII (1953), 285-290. 
Tuddenham, R. D., MacBride, P., and Zahn, V. “The Sex Composi- 
tion of the Group as a Determinant of Yielding to a Distorted 
Norm.” Technical Report 4, Contract NR 170-159, University of 
California, Berkeley, 1958. 

Wonderlic, E. Е. and Hovland, С. I. “The Personnel Test.” Journal 
of Applied Psychology, XXIII (1939) , 685—702. 


4 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 1, 1963 


SECOND-ORDER FACTORS IN QUESTIONNAIRE DATA! 


JOHN HORN 
University of Denver 


Tug 1957 edition of the Sixteen Personality Factor Questionnaire 
(16 P.F.) has been factored, either by itself or along with other 
variables, by Becker (1961), Cattell (1957), Cattell and Scheier 
(1959), Karson (1961), Karson and Pool (1958), Scheier and Cat- 
tell (1958), and others who have not published their results. In these 
studies two factors—usually the first two in order of variance con- 
tribution or size of latent root—appear with only rather minor vari- 
ations of pattern despite noteworthy differences in the composition 
of the samples of persons and variables, and despite procedural vari- 
ations such as those which result from investigators’ having different 
rotational philosophies, using different criteria (on samples of differ- 
ent size) to decide the number of factors to extract or interpret, ete. 
The reliable variance of the 16 P.F. variables is seemingly not ex- 
hausted by these first two factors, however. When the 16 P.F. is 
factored alone, for example, application of tests for determining the 
“correct” number of factors usually indicates more than two factors 
(e.g., Karson, 1961) ; in studies employing a wider range of variables, 
the questionnaire variables usually help to define more than two 
factors (e.g., Cattell & Scheier, 1958). But it has been difficult to 
regard the factors beyond the first two as replicated in different 
studies. Samples have not been described in sufficient detail to allow 
development of hypotheses that would, if verified, explain the differ- 
ences in factor structure in terms of known relationships between 


1 This investigation was carried out during the tenure of a predoctoral fel- 
lowship from the National Institute of Mental Health, United States Public 
Health Service. 


117 


118 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


demographie and personality variables. But, because procedures in 
the various studies have differed, it is reasonable to suppose that the 
variation in reported results is at least partly due to these diffe 
ences. The present study attempts to help answer this question for 
а particular variation on procedure, viz., variation on rotational proz 
cedure. At the same time the study presents some new evidence ОП 
the nature of the second-order factors in this widely used test. 
There has been much debate during the last thirty years about 
the correct or best rotational procedure to follow in factor analysis. 
In general, American and British investigators have taken opposite 
sides in arguing for or against simple structure as opposed to а bi- 
factor or hierarchical solution, and American investigators have dif- 
{егей among themselves on the question of whether good simple. 
structure and good opportunity for replication can be achieved when 
factors are required to be orthogonal. It would seem, also, that some 
difference of opinion exists (though it is rarely expressed) concerning 
the relative virtues of analytic or machine rotational methods а 
compared with the method of visual rotation. 
Bernyer (1958) and Eysenck (1953) have recently argued that 
the differences between results obtained by British and America 
workers do (or would) dissolve if the latter permitted factors to be 
come oblique and obtain a proper second-order solution. Their own 
results and those of others (e.g., Burt’s (1960) analysis as compare? 
with that of Jackson (1960) on the same data) do tend to support 
this contention, although the results in detail are by no means 4 
ways “matchable.” 
Cattell (1952, 1957) has contended that good simple structure i 
not possible in othogonal solutions except in a few cases—ofte 
artificially eonstructed—and that good (oblique) simple structul 
must be attained if results are to be replicable. Coan (1959) h 
reported a comparison of oblique and othogonal solutions obtained 
on the same data. His results suggest that, if one ignores the absolut 
size of loadings—which is essentially arbitrary—and considers 004 
their rank order, the factors obtained with the two methods are i 
some cases identical or clearly very similar, but in other cases ё 
not clearly matched. 
Most investigators would seem to agree that analytic proced г 
for obtaining simple structure are most desirable (cf., Ferguso" 
1954). But many would contend that the existing analytic pro oe 


JOHN HORN 119 


dures for obtaining oblique simple structure (e.g., Pinzka & Saund- 
ers, 1954; Carroll, 1957; Kaiser & Dickman, 1959; Cattell & Muerle, 
1960) are not adequate, and that visual rotation is therefore still 
necessary. The latter is time-consuming; it is an art that many have 
not mastered; although it can be as objective as an analytic method, 
it probably often is not. If it is conceded that under optimal condi- 
tions visual rotation will usually yield the best simple structure, the 
practical question still remains of whether the results obtained by 
this method are sufficiently different from those obtained by analytic 
methods to justify the extra effort and time that is required. 

The present study will not provide conclusive answers to any of 
these general methodological questions. It really aims only to help 
provide a better understanding of the structure inherent in a par- 
ticular set of variables—the sixteen personality factors in question- 
naire media. But in seeking this, the above general methodological 
issues had to be faced. Results were obtained in a way that would 
cast light on some of the questions raised. Comparison of rotational 
solutions here reported should prove helpful to workers contemplat- 
ing use of one or the other, but not all, of the available methods, 
and in reviews of the results obtained by use of these different proce- 
dures. 


Procedures 


All factors except B (intelligence measured by verbal analogies) 
were represented in the study by their A forms in the 16 P.F. In 
addition, the B forms of C, Qs and G were entered separately in 
order to increase the relative variance contribution of 16 P.F. second- 
order factors beyond the first two, for these variables have entered 
prominently into one or more of the later factors found in most 
previous studies. 

Besides the 16 P.F. variables, two “new” questionnaire variables 
were constructed, also with the intention of increasing the relative 
variance contribution of second-order factors beyond the first two. 
One of these aimed to measure the “confidence” which, it was 
thought, would stem from the “unbroken success” that Cattell 
(1957) has said characterizes a third factor appearing with some 
consistency in separate studies. This “new” variable was found to 
have a KR-20 internal consistency of .59. The other “new” variable 
was designed to measure the “lack of conformity” which Karson 


120 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


(1961), for example, has said characterizes a fourth second-order 
factor sometimes found among the 16 P.F. variables. This question- 
naire was found to have a KR-20 internal consistency of .63. 

The questionnaires were administered to 172 men; 33 were under- 
graduate students aged 18 to 33; 65 were air force enlisted men 
aged 18 to 26; 74 were state penitentiary convicts aged 19 to 38. 

The product-moment correlations between the 20 variables are 
given above the diagonal in Table 1.2 

To decide the number of factors, unities were first inserted in the 
diagonals and centroid factors were estimated for m/2, or 10, fac- 
tors. The centroid sum of squares (analogous to the latent root) 
became less than unity on the sixth factor. Following the proofs of 
Guttman (1954), who shows that the lower bound for the rank of a 
matrix under these conditions is that just previous to where the 
latent root becomes less than unity, the proofs of Kaiser (1960), 
who shows that the alpha internal consistency for the factor becomes 
zero or negative at this same point; and the arguments of Dickman 

(1960), who contends that it is unreasonable to rotate and try to 
interpret factors which have less than the arbitrary unit variances 
assigned to tests, the number of factors was estimated at five. Com- 
munalities were then stabilized through eleven iterations of the 
centroid process. The estimated communalities and the final cen- 
troids are given in Table 2. The residuals after the five factors had 
been extracted using these estimated communalities (and not again 
re-estimating after each factor extraction) are given in the lower 
part of Table 1, including the principal diagonal. At this point 99.4 
per cent of the common variance implied by the estimated com- 
munalities had been extracted, Applying the combination of tests 
which Sokal (1959) found most dependable, the residuals were 
found to be insignificant. 

Since investigators occasionally choose not to rotate the centroids 
(e.g., Osgood, 1957), the factors presented in Table 2 can be regarded 
as one kind of “final” solution. Numbers indicating the relative vari- 
ance contributions (proportion of total variance) of the factors аге 
listed in the row labeled “Var.” In this and all other factor tables, 


2 Although results throughout this report are given to only two decimal places 
(omitting decimal points), actual calculations in all but the graphic rotation 
process (which used two places) were carried to four or more places in order 
to minimize cumulative rounding errors. 


00 10— f0— £0 90 т0— 20 то 90— 20 $0 90 $0— әопәриәйәри MƏN °05 
30— 10— 70 — T0  t0— Y0— 10— 20— 10— 70 30. 90— 10— £ c0 zO— I0— eouopguoj)-jos MƏN "GT 
C0  TY0— 10— 90— £0 10— c0 #0 90— c0— c0— ғо 00 00 20 00 £0 (9) o3; 1edng ғ) "81 
00 70 20—90 70 00 #0 0-20 10— e0— бу 90 30— #0 50— (y) o3 aodng 5) ‘ZT 
TO 2 0 80 10— c0— 20 90 #0 FOE 10-70-10 cO = (9) 10107) t$ *9r 
20— 90— 00 #0 10 c0 g0— £0— I0— 10— 80— cO 10 TI (V) 1023000) t) “eT 
€0— 00 c0— 10—40 #0— v0— 00 10— 20 00 90 90— (9) Muang 934 О ‘FI 
£0— @0— I0 £0— 10— €0— +0 со 00 90-20 10  (v)um3uong o O ‘eI 
$0— 10— 10— €0 ¥0— 10 20 c0— c0 #0 co ANPIL, O фт 
10—60 10— #0 10 10 90— z0 20 £0 Uou BIG 'фу “TT 
с0— 90— £0 10 со со— £0 0—10 uoruojoiq "Т "Qr 
00 *$0— g0 +0— £0— so @0— 90 umy W ‘6 
S0 10— 00 10— #0 £0- 10— шә ['g 
$0 10— £0 80— 90— 80 wsqeorpoy фу”; 
T0 © 270— 00 90— SsoupMalYS N ‘9 


09 09. OS £0— 97— 0I— 0c— TI— 060—1 ат $0 c 60 90— c0— c0 &ouoroging-j[og © *e 
6€  T0— IG 00—01 c& 8 9 OI 9c— 0£— cc— 91— 00 00 80 c0 €0— тшш H ‘$ 
08 90 90 2 6I 20— +0 0ST 10 9I т 0G 10 €0— 10 Aouasing д 'g 
F9 12 If 90 10 81- OI— II— 20— vI— GI FI eI TE 5 = gI— с0— 9ouwumuo( A ‘g 
te  II— 80 27— £0 If 80= 21 30— ZI— 91— vI— cI— Iz 90 тке 


зттшАцуорАгу у 'r 
[ri ИБ ЫР UE OC GE FE PL д 3 OL e 8 A 9 e т 8 [4 I 


THEA 


(2001 op тү) зә юипшшоду puy 
(muoma әү} mojag рио ит) sypnpisay ‘(ouoboig əy} әаодү) S2]qU14D A usomjog suoiv]24407) 


I TIJYL 


122 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 2 
The Centroid Solution 
Est. Sum 
Variables Ke п. ш, IV. Ve м м 

А 14 28 —32 —29 13 311 301 
E —12 24 33 —58 37 62 654 
Е 09  —15 34  —33 19 301 291 
н 42 15 -05  —34 23 303 370 
Q —41 22 25 41 —24 496 505 
N 16 —10 16 15 03 089 085 
Qı —22 34 38 —08 01 3290 818 
I —26 42  —28 23 —27 442 448 
M —47 16 13  —10  -—19 319 310 
L -47 —29 21  —17 11 390 390 
9, —61 —38 20 —14 11 587 588 
о —53 —44  -17 —06 —0 517 5% 
Ca 49 06 31 05 —05 349 345 
Cy 55 28 03 08 —05 389 391 
Qia 51 —03 03 —15 —29 369 368 
Өз 54 22 —01 18 —03 376 378 
Ga 48 —16 24 —13 —24 392. 388 
Gp 10  —34 39 07 16 313 308 
Conf. 51 34 28 з —i0 407 . 408 
Indep. —15 28 35 18 16 302 281 
Var. 324 145 128 110 065 
Hyp. 2 3 4 z 7 23 


the number of variables at or within a + 10 hyperplane are given 
in a row labeled “Hyp.” The last entry in this row is the total hyper- 
plane count across all factors. 

Rotational solutions are presented in Tables 3 through 8. In all 
of these, factors have been reflected and re-arranged to facilitate 
comparisons and to correspond with results reported in earlier stud- 
ies. The h? column gives the sums of squares across the orthogon 
factors finally obtained (except in the centroid table where the com: 
munalities estimated just previous to the final computation—th 
“entered-with” estimated communalities—have also been given) 
These should, of course, be identical. That they are not indicates th 
extent of the rounding errors that can occur even when four ani 
more places are carried in the computations. 

In Table 3 are the results obtained from application of the Vari 
max criterion (Kaiser, 1958). This is an orthogonal solution. Herej 
and in Tables 6 and 7, the numbers indicating the relative varian 
contributions (row “Уаг.”) are for the rotated factors. 

In Tables 4 and 5 are listed the reference axes? that result fro 


3 By and large most investigators, both in the past and at the present time 


JOHN HORN 123 


TABLE 3 
The Varimaz Solution 


Variables I, п, ш, ІУ, ү, м 
А —20 28 -17 38 -1 304 
E 12 74 31 08 -03 659 
F 10 44 06 -24 16 292 
н —32 49 —19 08 05 386 
Q 10 —45 54 01 —07 504 
N —11 —04 01 —26 05 085 
Q: —02 15 54 06 01 324 
I —04 —03 14 65 02 448 
M 32 —06 38 24 04 312 
L 58 12 14 -13 —03 386 
9, 73 07 15 -14 —06 587 

67 —18 —15 05 —03 517 
Ca —43 10 07 —28 27 346 
Сь —60 04 —03 —02 13 386 
Ф. —34 05 —20 —02 45 367 
Ө —60 —04 —08 —07 07 373 
Ga —25 10 —10 —23 50 391 
Gp 09 10 04 —53 10 310 
Conf —57 16 18 —06 27 462 
Indep —08 03 48 —13 —19 287 
Var. 304 140 130 124 073 
Hyp. 6 11 7 9 12 45 


application of the Oblimax (Pinzka & Saunders, 1954) and the Bi- 
normamin (Kaiser & Dickman, 1959) criteria, respectively. These 
are oblique solutions. The correlations between factors are given 
above the principal diagonals at the bottom of each table. These 
correlation matrices were themselves factored in a manner like 
that described above. The final residuals are again given in and be- 
low the main diagonal in the matrices at the bottom of Tables 4 and 
5. The two factors estimated in each case were rotated to the orthog- 
onal position given by application of the Varimax criterion, after 


report and interpret the reference axes structure rather than the factor pattern 
when an oblique solution is obtained. Debate continues over the question of 
what form of expression of results is “best” to interpret (e.g., see White, 1961), 
but this debate usually does not hinge on a comparison of the structure and 
pattern mentioned here, for these are proportional and will, provided cut-off 
points between “high” and “low” loadings are also adjusted proportionately, 
always lead to interpretation of exactly the same sets of variables. The con- 
troversy concerns the relative merits of the factor structure vs. the factor pat- 
tern, the reference axes structure vs. the reference axes pattern, and each of 
these vs. the matrix of beta weights for estimating the factors, or a unit-wise 
product of weights. In the present case, both because tradition seems to dictate 
it and because the reference axes elements are scaled in a manner more nearly 
like that for the othogonal factor elements, the reference axes structure (but 
the estimated correlations between factor scores) are reported. 


124 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


which а Schmid-Leiman (1957) transformation was applied, pro- 
ducing the results reported in Tables 6 and 7. These again are 
orthogonal solutions, but solutions which take some account of the 
correlations that ean be allowed between the factors in oblique 
solutions. Such solutions are probably similar to those which might 
be obtained by application of Burt’s method (e.g., Burt, 1949; Hol- 
zinger & Harmon, 1938) and, in fact, Bernyer (1958) has used a 
similar transformation procedure (Thurstone, 1947) to obtain such 
a solution with ability tests. 

Finally, in Table 8 are listed the factors obtained after successive 
visual oblique rotations by an expert in factor rotation* who had no 
knowledge of the nature of the variables. He was asked to provide 


TABLE 4 
The Oblimaz Solution 
Variables T^ TIS III, IV, Vo 
A —19 25 —34 20 —10 
E 08 80 —24 07 —06 
F 17 40 —19 —13 12 
H —21 38 —43 00 01 
Q —05 —24 67 11 —06 
N —08 —08 06 —20 01 
Qi —06 32 31 14 —02 
I 00 08 10 60 09 
M 23 10 32 32 08 
L 49 19 02 —06 01 
Qa 60 16 05 —08 —01 
0 60 —16 —02 06 06 
©, —26 02 06 —13 18 
b —47 —03 —01 —01 06 
Qin —07 —09 —08 14 42 
Qu —49 —12 00 —08 00 
A 02 —03 —03 01 45 
Gp 14 05 00 —39 04 
Conf. —41 13 11 06 18 
Indep —23 17 31 —12 —24 
Hyp. 719 8 11* 10 14* 50 (tot.) 
15 06 —05 17 09 —50 
II, —04 —01 39 —23 19 
ПІ, 04 01 —01 —22 —06 
IV, —12 05 —05 05 —53 
Vo 05 —01 01 07 —02 


+I thank Dr. A. B. Sweney for performing this chore. Good rotators, like 
good psychoanalysts, are sometimes said to be born, not made. Whatever the 
source of his talent, Dr. Sweney is acknowledged to be exceptionally adept 2 
the art of rotation. 


Ам, 


JOHN HORN 125 


TABLE 5 
The Binormamin Solution 


Variables I Il, шь IV» Vs 
A —20 34 —06 25 —11 
E 21 68 30 10 —05 
F 24 37 05 —16 12 
H —20 49 —01 02 02 
Q 00 —50 42 10 —02 
N —07 —07 05 —22 03 
Qi 04 07 51 15 02 
I —01 —01 10 63 08 
M 28 -11 22 32 08 
L 54 10 —07 —08 —04 
Qı 66 06 —12 —10 —07 
0 57 —14 —41 05 —02 
C. —25 01 22 —17 24 

b —49 01 20 —01 12 
Qia —13 —01 —08 10 44 
Qu —52 —05 14 —08 06 
Ga 00 00 —02 —06 47 

18 04 01 —44 05 
Conf. —38 07 38 04 26 
Indep —13 —02 48 -11 —18 
Hyp. 5 13* 9 11* 12 50 (tot.) 
I 01 —19 —35 04 —39 
Tl, —04 01 —01 —06 29 
III, —01 04 01 —12 —05 
IV, 03 —04 —03 02 —42 
Vs 02 03 —02 03 —06 


what he considered to be “the best possible simple structure obtain- 
able in a reasonable amount of time.” | 


Discussion of Results 
1. Comparison of Rotational Solutions. 

The goodness of simple structure is indicated by the relative num- 
ber of variables with near zero loadings in a factor and for a vari- 
able. This is implicit in Thurstone's (1947) statement of the principle 
and in Bargmann's (1954) statistical criterion for goodness of rota- 
tional solution.5 Here a count of the number of variables in a + 10 
hyperplane discloses that Sweney’s visual rotation provides the best 


5 This latter requires that test vectors be extended in accordance with the 
test’s communality. Cattell and Meschieri (1960) have questioned this proce- 
dure on grounds that it allows tests with small communalities (often the result 
of low reliability) to produce large factor loadings and thereby distort the sub- 
sequent test of significance of the simple structure. An acceptable alternative 
to this criterion has not been worked out, however. 


126 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


' TABLE 6 
The Schmid-Leiman Solution Based On Oblimaz 


Variables І, Il. Ils ТУ, Veo VI, ҮП. 


A —19 21530 2194 —06 . —05 ——20 
E 08 69. 21 07 —04 =1 36 
T 18 34 17 14 07 22 24 
H 22 32 . 1—38 01 01 26 —07 
9, -05 —21 58 П 07—049. —25 25 
N -08  —06 05  -21 07 16 07 
Q —06 27 27 15  -001  -07 37 
I 00 07 09 61 05.5525 12 
M 24 08 28 33 05  -25 20 
List 5 16 02  —06 oO —22 24 
9, 62 13 04 08  —01  -—32 26 
О. 62 -14 02 06 04 —34 -07 
ед —27 02 05. —13 11 48 11 

b —49  —03  -01  —01 04 37 —08 
за go -07 —07 15 24 50 —14 
b 50 -1 00  —09 00 33 1-1 

A —03 -—03 01 26 57 01 
14 05 00 —40 03 22 23 

Conf. Г 11 09 06 11 46 11 
Indep) — 15 а 534 9 — 11 34 
Var 1294 099 090 087 019 193 086 
Hyp 7 8 12 9 15 3 5 

; ТАВІЕ 7 
۹ The Schmid-Leiman Solution Based On Binormamin 

Variables Tab » IL ш» IV.» Vab VI.» VII» 

== T neem) Vb Vn М. "7 
A 718 4435 05 25 —05` —09 -25 
Е 18 69 24 09  —02 02 28 
F 21» *f 38 04  —16 05 24 16 
H -17 —01 02 01 22 —22 
Q: 00. «— 33 10  —01 —16 35, 
N —00 —07 04  —22 01 16 02 
Qi 04 07 40 15 01 01 36 
I 0920 08 62 04  —23 —06 
M 24 |. —M 17 32 04  —19 29 
L 47 10 -0 -08 -02  —19 34 
Qi 57 06 -10 —0  —0з —28 40 
0 49  —14 -33 05 —01 —34 11 
C. —21 01 17 -17 11 48 —05 
Ch —43 01 16 —01 06 35 -23 
Qu -11  -01 06 10 20 49 -27 

» —45 —05 1111-68 03 Si 225 

ff 00 00 —02 -06 21 57 -12 
Gp 16 05 01  —43 02 23 19 
Conf.  —33 07 30 04 12 48 —07 
Indep. —11 —02 399 -11  —08 · —05 32 
Var. 158 130 081 093 013 178 117 
Hyp. 4 13 10 11 16 4 4 


JOHN HORN 127 


TABLE 8 
The Visually Rotated Solution 


Variables Ta TIR III IVa Vr 
A —21 38 —06 28 —09 
E —04 65 49 07 —10 
F 07 34 20 -18 10 
H —99 49 03 02 02 
Qi 05 —53 34 06 —09 

—08 —10 03 —23 02 
Qi -06 V 08 54 10 —05 
07 07 10 63 08 
M 27 —09 28 30 05 
L 45 09 08 —08 —02 
ғо, 57 05 04 —08 —04 
61 —08 —31 09 05 
D; —29 —04 20 —21 19 
Co —46 —01 10 —03 08 
Qi. —07 01 —07 08 43 
Озь —48 —08 02 —10 03 
Ga 00 —01 03 —09 46 
Gb 07 —03 08 —46 04 
Conf. —42 03 34 —@01 19. 
Indep 293 —09 45 —15 —26 А 
Нур. 9 15 11 12 15 62 
, 
im — * 
nm 02 — м 
Ша 27 —03 — y 
IVa —09 ES —09 — 
Vn —48 UE EB Е 


simple structure. This finding is in agreement with the results-ob- 
tained by Fruchter and Novak (1958). The Schmid-Leiman trans- 
formation, providing for two additional, more general factors, leaves” 
в somewhat better simple structure on the first five factors than is 
provided by application of the analytic criteria alone. Although 
- Binormamin and Oblimax give the same total hyperplane count 
initially, the former results in a somewhat better solution when the 
Schmid-Leiman transformation follows. 

It will be noted that all of the loadings in Vs» and Vso are small. 
Many investigators would not interpret variables with loadings no 
larger than the highest in these columns, the implication being that 
these are null vectors. In contrast, the visually rotated factor V has 
loadings (.43 and .46) that would lead many investigators to an 
attempt at interpretation. | 

Still considering only the first five factors, it will be noted that 
the hyperplane count of the Varimax factors is lower for all but one 


128 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
of the factors than that for the comparable factor obtained through 


` -a Schmid-Leiman transformation. i 


As would be expected, simple structure is not good in the unro- 
tated centroid solution. 
` It is possible to identify the “same” factors in each solution ex- 


А cept that of the unrotated centroid. The factors have been numbered 


ina way to show this. In addition the variables have been ranked in 
Table 9 according to absolute size of loading (for loadings above - 
about 20). Here it will be observed that the highest: (a) nine vari- © 
ables in factor VI, (b) six variables in factor I, (c) five variables in 
factors II and IV, and (d) two variables in factor V are, in general, 
the same for all solutions and that there is considerable agreement 


. in the rank orders of these salients within each of the factors. Thus, 


provided the interpreter did not try to make sense out of the differ- * 
ences in size of the salient loadings (thereby implying that these 


т differences were significant), and confined his interpretation to just , 


that number of variables where agreement across solutions is good, 
the basis for interpretation of these five factors would be virtually 


"b the same regardless of which rotational procedure was used. 


ay 
FRESE OO COM E“S ZO RNB 


. The solutions do not agree on the salients and the rank order of 
Salient i in the factors numbered III and VII. Perhaps more factors 
should have been extracted. Warburton (1954) has argued that com- 


4 * TABLE 9 
Rankings of Factor Loadings 


ш, ш, IIb Ше 


~ 
= 
- 
= 


4 
1 
ох 5 
9 10 10 т 3 2 2 
\ 1 
, 2 1 1 3 
6 
E 2 4 1 50 Ci 
" . 
9 8 7 ?7 "ERN 9 7 7 6 4 4 
тозе 
S *i 1 1 ES 
` 
ES oic ai1 GENS Ur. 4 
EM 7.32 T7 ОУ 7 
EX 5 A 5' 6 
в ч 
Tee Т СЕ. 
з à 
ч 
втер неса 
8 


E osi 


т» 


“eS 


t, 
poocg-6zomnumu»| 


m 


JOHN HORN 129 


TABLE 9 (Continued) y T 
سے‎ 


^ ШШЕН АЛЫ Yin. 
HL УП» УП» IV, IV, IV» ІУ ТУФ IV, Ve Vo Vs Vo. Và Ow. VI 


> 9 s 4. ФЕ eee 
IM 7 , i 
6 6 ۴ (ag 1: 
11 Р . 9 
5s 3 4 10 
5 Г BE bk 
Eu. 3 
1 1 1 1 1 1 10 
T -9 6 6 8 F3 DS 3 3 10 
6 4 13 
E ui 8 
6 6 
4 3 4 3 
5 
8 Г ig) "m o TS 
9 7 
7 1. Е V NE 
8 2. 9.9.4 19709 2 13 
4 3 3 m 0 6 
е5 3 3 


. plete factoring with unity “communalities” will tend to eliminate the 


kind of ambiguity occasioned here. Both the factor IIH obtained by 


Oblimax and that obtained by Binormamin might then appear sep- 
arately. , usd 


- 2. Interpretation of Factors. ^ 


naa 
V&-introversion, There is no cause to elaborate interpretatio ! of 
these factors. The theories have been sufficiently та? se- 
44" 4 T 


| 


130 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


the sign of E is reversed, and, more noticeably, threctia and schizo- 
thymia come prominently into the pattern, Interpretation would 
thus emphasize self-sufficiency and independence that exists without 
dominance and extraversion. In considering the alternative rota- 
tional positions, one is reminded that revolutionaries of the past, 
while usually critical of their times and independent-minded, have 
been both obtrusive and inobtrusive in their modus operandi. More 
work needs to be done if these differences are to be represented in 
clearly separate and replicable factors. 


The pattern of factor IV does not clearly replicate previous find- 


ings. It appears to reflect individual differences in a sensitive casual- 


` ness such as is sometimes said to characterize the artist or actor. 


Whatever the occupational associations, the stereotype is common 
enough to suggest that such a pattern should be found repeatedly 
in questionnaire data. That it has not been found in other studies 
may be due in part to the fact that these studies either used samples 
‘containing females (high on both I and G) or samples that were in 
other ways more homogeneous than the sample here employed. 
Factor V is similar to that which Karson (1961) and Karson and 
Pool (1958) have interpreted as “obsessive-compulsivity-vs.-socio- 
pathic deviance.” The present study provides no support for this 
terpretive hypothesis, however—unless one chooses to suppose that 
*Onvicts should be less given to sociopathic deviance than noncon- 
vict; the point biserial correlation between the categories “in 
priso -vs.-not in prison" and Qs, is .22; i.e., the mean for “controlled- 
Mess” is significantly higher for the convicts, The independence 
and Jd variables correlate .004 and .003, respectively, with this 
sample dichotomy. The multiple correlation of the three variables 
with the $ prison” criterion is .24. Thus, it would seem that con- 
Whit more “obsessive-compulsive” than nonconvicts. In 
iS evidence it would seem advisable to drop the “socio” 
a’ part of Karson's hypothesis. 
Factor VI May best represent individual differences in a fairly 


general tendency a people have to present themselves in a favorable 


‚ light, to maintain @positivé self-image, ete. (c.f., Edwards, 1957). In 


this all male sample, and taking the interpretation for the 16 P.F 
variables which Сайы! (1957) has given, the socially desirable 
stereotype is apparently of a person who is conscientious and рег" 
sistent (G+-), controlled and exact (Qs-+-), mature and calm (C+): 


JOHN HORN 131 


confident, not insecure or anxious (O—), nor tense and excitable 
(Qs—), nor sensitive and effeminate (I—), nor, rather surprisingly, 
self-sufficient and resourceful (Qs—). This person would also tend 
to be enthusiastic and talkative (F--), adventurous (H+), not 
Bohemian and unconcerned (M—), not suspecting and jealous 
(L—), and perhaps sophisticated and polished (N+). The subjects 
in this sample were apparently not inclined to agree that the "ideal" 
man should be dominant and aggressive (E+), warm and sociable: 
(A+), experimenting and critical (Qi) and independent minded. 

Although the agreement in rank order of variables between the 


two Schmid-Leiman solutions on factor VII is not good, the factor — 


interpretation can be very much the same for both solutions. They 
suggest that there is а fairly consistent tendency for self-ratings of 
"independent," “self-sufficient,” “critical,” and “dominant” to also 
connote tense excitableness, suspicion, jealousy, and cold aloofness. 
If factor VI were interpreted as an evaluative general dimension in. 
line with Osgood’s (1957) findings, then it would seem reasonable to 
interpret factor VII as an at least suggestive match with the “ро- 
tency” factor repeatedly found in the studies by Osgood and his 
coworkers. In fact, however, the pattern seems more nearly an in- 
completely resolved combination of Osgood’s “potency” and “ac- 
tivity” dimensions. As has been pointed out by a reviewer, this — 
factor is “also very similar to the pattern of temperament correlates 
reported for the Witkin, et al. (1954) ‘active, analytic vs. passive, _ 
global’ dimension, also known as field dependence.” — 
Summary З 

Using a sample of 172 men, the second-order factorial structure 
among the sixteen personality factors that have been’ repeatedly 
found in questionnaire media was studied by means of comparisons 
of results obtained with five commonly used rotational procedures, 
It was found that the basis for interpretation of four factors would · 
be much the same regardless of which procedure was used. Alterna- 
live rotations having equally good simple structure were found for 
one factor. A solution which allowed more general factors to be 
determined from the intercorrelations of rotated factors was found 
to maintain the simple structure on the “nongeneral” factors and 
the additional factors provided by this method: were found to be, 
interpretable. 


132 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Two factors were easily matehed with similar factors found in 
previous studies and there interpreted as dimensions of extraversion 
and anxiety. A factor V was matched with a previously found pat- 
tern that had been interpreted as measuring sociopathic deviance. 
Evidence was presented which would cast doubt on this interpreta- 
tion. A factor III contained self-rated independence and criticality. 
Alternative rotations suggested that self-sufficiency and withdrawal 
might characterize one dimension of independence, while obtrusive 
criticality might characterize another. The most general factor in 
these data apparently reflected an evaluative ordering of the Q-data 
traits. The other general factor was regarded as an incompletely 
resolved combination of Osgood’s “potency” and “activity” dimen- 
sions. Finally, a bohemian-sensitivity factor was found which ap- 
peared to represent a fairly common dimension of behavior rating, 
although it did not clearly replicate previous questionnaire findings. 


REFERENCES 


Bargman, R. “Signifikansuntersuchungen der Einfachen Struktur in 
der Faktoren-Analyse.” Mitteilungsblatt fur Mathematische 
Statistik. Sonderdruck Würzburg: Physica-Verlag, 1954. 

Becker, W. C. “A Comparison of the Factor Structure and Other 
Properties of the 16 P. F. and the Guilford-Martin Personality 
Inventories.” EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 
XXI (1961), 393-404. Р 

Bernyer, б. "Second Order Factors and the Organization of Cogni- 
tive Functions.” British Journal of Statistical Psychology, Xl 
(1958), 19-29. 

Burt, С. "The Factor Analysis of the Wechsler Scale." British Jour- 
nal of Statistical Psychology, XIII (1960), 82-87. 

Burt, С. “Alternative Methods of Factor Analysis and Relations to 
Pearson’s Method of ‘Principal Axes.’ ” British Journal of Psy- 
chology, Statistical Section, П (1949), 98-121. 

Carroll, J. B. “Biquartimin Criterion for Rotation to Oblique Simple 
Structure in Factor Analysis." Science CXXVI (1957), 1114- 


Cattell, R. В. Factor Analysis. New York: Harper and Brothers; 


Cattell, R. B. Personality and Motivation Structure and Measure- 
ment. New York: World Book Company, 1957. 

Cattell, В. В. and Meschieri, L. “The International, Cross-Cultural 
Constancy of Personality Factors.” Advance Publication No. 12, 
University of Illinois, 1960. 

Cattell, R. B. and Muerle, J. L. “The ‘Maxplane’ Program for 


ў 


JOHN HORN 183 


Factor Rotation to Oblique Simple Structure." EDUCATIONAL 

AND PsycHOLOGICAL MEASUREMENT, XX (1960), 569—590. 

Cattell, В. B. and Scheier, I. Н. ‘Extension of Meaning of Objective 
Test Personality Factors: Especially into Anxiety, Neuroticism, 
Questionnaire and Physical Factors.” Journal of General Psy- 
chology, LXI (1959), 287-315. 

Coan, R. W. “A Comparison of Oblique and Orthogonal Factor 
Solutions.” Journal of Experimental Education, XXVII (1959), 
151-166. 

Dickman, K. W. “Factorial Validity of a Rating Instrument.” Un- 
published Ph.D. thesis, University of Illinois, 1960. 

Edwards, A. L. The Social Desirability Variable in Personality As- 
sessment and Research. New York: Dryden Press, 1957. 

Ferguson, G. A. "The Concept of Parsimony in Factor Analysis." 
Psychometrika, XIX. (1954), 281-290. 

Fruchter, B. and Novak, E. “А Comparative Study of Three Meth- 
ods of Rotation." Psychometrika, XXIII (1958), 211-221. 

Guttman, L. “Some Necessary Conditions for Common Factor 
Analysis.” Psychometrika, XXIX. (1954), 149-161. 

Holzinger, K. J. and Harmon, H. H. “Comparison of Two Factor 
Analyses.” Psychometrika, ПІ (1938), 45—60. Re 
Jackson, M. A. ‘The Factor Analysis of the Wechsler Scale.” British 
Journal of Statistical Psychology, XIII (1960), 79-82. wick 
Kaiser, Н. F. "The Varimax Criterion for Analytie Rotation in 

Factor Analysis.” Psychometrika, XXIII (1958), 187-200. 

Kaiser, H. F. ‘Alpha-Reliability of Factors.” Unpublished manu- 
Script, Urbana, Illinois, 1960. "A. 

Kaiser, Н. F. and Dickman, К. W. “Analytic Determination of 
Common Factors." А. P. A. Convention, 1959. 

Karson, S. “Second-Order Personality Factors in Positive Mental 
Health.” Journal of Clinical Psychology, XVII (1961), 14-19. 
Karson, S. and Pool, К. В. “Second-Order Factors in Personality 

wg eal Journal of Consulting Psychology, XXII (1958), 

Osgood, С. E., Suci, G. J., and Tannenbaum, P. H. The Measure- 
1007 of Meaning. Urbana, Illinois: University of Illinois Press, 

Pinzka, C. and Saunders, D. R. “Analytic Rotation to Simple 

tructure: IL Extension to an Oblique Solution." Educational 

8 Testing Service Research Bulletin No. 54-31, 1954. 

cheier, I. H. and Cattell, R. B. “Confirmation of Objective Test 
actors and Assessment of their Relation to Questionnaire Fac- 

8 EN Journal of Mental Science, CIX. (1958) , 608-624. ; 

c ae J. and Leiman, J. M. "The Development of Hierarchical 

Sok Dor Solutions." Psychometrika, XXII (1957), 53-61. 

TZ R. R. ^A Comparison of Five Tests for Completeness of 
actor Extraction.” Transactions Kansas Academy of Science, 
XII (1959), 141-152. 

pono, L. L. Multiple-Factor Analysis. Chicago: University of 
hieago Press, 1947. 


134 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Warburton, F. W. “The Full Factor Analysis.” British Journal of 
Statistical Psychology, VIL (1954), 101—106. 

Witkin, H. A., Lewis, H. B., Hertzman, M., Machover, K. Meissner, 
P. B. and Wapner, S. Personality Through Perception, New 
York: Harper and Brothers, 1954. 

White, O. “Some Properties of Three Factor Contribution Matrices.” 
Advance Publication, University of Illinois, 1961. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 1, 1963 


OPTIMUM GRADE CLASSIFICATION WITH THE 
CALIFORNIA ACHIEVEMENT TEST BATTERY: 


HARRY E. ANDERSON, JR2 
American Institute for Research 
AND 
DONALD A. LETON 
University of California, Los Angeles 


PRESENT-DAY achievement tests are usually constructed to meas- 
ure several areas of achievement and, consequently, present a bat- 
tery of subtests, each of which purports to measure one particular 
aspect of achievement. Test buyers are constantly concerned with 
Problems of selecting a valid and reliable test, with considerations 
for the information available from the test, and with cost factors. 
Test users on the other hand must consider the meaning of test 
Scores (e.g., Cronbach, 1960, pp. 69-87), the diagnostic and prog- 
Nostic value of the test scores, and a reasonable summary of an in- 
dividual’s achievement record. The latter problem is often resolved 
by means of a profile pattern indicating an individual’s standing on 
each subtest (viz., achievement area) in relation to population norms. 

Over-all achievement level, however, is not obviously available 

m an individual's profile, and so the user is left with no over-all 
estimate of a Student's achievement. Moreover, several other 
Problems ате encountered in the summary and interpretation of 
Achievement test results. First, there are differences in absolute 


Achievement, levels indicated by different achievement batteries, and 
B =— 


1 The authors е 2 2. . 5 
i Xpress appreciation to the Santa Monica City Schools, par- 
ticularly to th PP : 


З е Director of Research, Dr. Julius Stier, for permission and as- 
№ 1n conducting this study. 
Ow with the California Test Bureau. 


135 


136 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


secondly there are circumstances in which characteristics of the lo- 
cal sample are different from those of the population on which the 
norms were constructed. These two problems were brought to focus 
rather pointedly by Stake (1961) in relation to the California 
Achievement Tests (CAT). A test user, however, may not be inter- 
ested in absolute achievement levels so much as in differentiating 
among achievement levels. If the achievement battery is reasonably 
valid and reliable, a counselor, teacher, or administrator may be in- 
terested, for instance, in determining whether a given seventh grade 
student’s over-all achievement is more consistent with other seventh 
graders than with students at any other grade level. Comparisons 
in most cases will be with students in adjacent grades, such as com- 
paring seventh graders with sixth and eighth graders. The present 
study is designed to illustrate techniques for resolving some of the 
foregoing problems. 

The CAT, one of the most widely used achievement batteries, 
offers subtest scores in Reading Vocabulary (RV), Reading Com- 
prehension (RC), Arithmetic Reasoning (AR), Arithmetic Funda- 
mentals (AF), Mechanics of English (ME), and Spelling (S). The 
CAT, then, because of the subtest scores and available profiles, pre- 
sents many of the problems discussed above. The CAT has been used 
to predict school grades (e.g., Anderson, 1961) and also has been 
found useful in methods of absolute prediction such as presented by 
Anderson and Fruchter (1960). The subtests are known to correlate 
significantly at most grade levels, however (Anderson & Slivinske, 
in press), so that having a student’s RC subtest score, for instance, 
in addition to his RV subtest score does not necessarily provide 
additional information with regard to the student’s reading achieve- 
ment. Moreover, although the achievement profiles available in the 
test are of interest for diagnostic and prognostic purposes, the multi- 
ple scores are often confusing in terms of the over-all assessment, and 
possible classification, of individuals. The present paper presents 
discriminant analyses, using a linear compound of the six subtests, 
which optimize classification of students for adjacent grades from 
Grade Four to Grade Twelve. 


Method 


The Sample. The 1957 edition of the CAT (Form W) was ad- 
ministered to the entire student population, grades four to twelve, of 
the Santa Monica City Schools in October, 1960. The total number 


ANDERSON AND LETON , 137 


of students at each grade level ranged from approximately 800 to 
1000. Random samples of 150 students were drawn at each grade 
level, and the analyses for the present study were computed on 
the data obtained from these random samples. 

The Subtest Scores. The grade placement scores, rather than the 
raw scores, were used in the analyses of the present study for sev- 
eral reasons: (a) they offer the most direct comparisons of achieve- 
ment levels from one subtest to another; (b) they allow for com- 
parisons between the several levels of the test (viz., Elementary, 
Junior High, and Advanced); and (c) they are usually the most 
available set of scores for teachers, administrators, and counselors. 

The Analyses. Mahalanobis (1936) distance functions, and the 
corresponding discriminant weights for each of the six subtests, were 
computed to maximally differentiate students in adjacent grade 
levels (between grades four and five, between grades five and six, and 
80 on) from Grade Four to Grade Twelve? 


Results 


The table of subtest means for each grade level, together with the 
eight sums of squares and cross-produets matrices, have been de- 
Posited with the American Documentation Institute. The results 
of the analyses are presented in Table 1. 


Table 1 presents the discriminant weights for each of the subtests 
together with the D? distance and F values. The Р? distance func- 
tions between the i and j groups are defined in this study as, 


Г = di,8^ dy; а) 
where d,, is the six-by-one column matrix of mean differences be- 
tween the i and j groups, d,,’ is its transpose, and S~ is the inverse 
of the six-by-six dispersion matrix. The conversion of D° to F, follow- 
ing Rao (1952), is taken as 


к e| NNN, + М, — =i 
P = p| MM tp DY, e 


Where N, is the sample size of one group, N; is the sample size of 


m 
a 
for The authors are indebted to the Western Data Processing Center at UCLA 
+ d their analytic programs and computer facilities. US 
Pista Document No. 7362 from the ADI Auxiliary Publications Project, 
$1251 Uplication Service, Library of Congress, Washington 25, D. C., remitting 
money Advance for 35 mm. microfilm or $125 for photocopies. Make check or 
order payable to Chief, Photoduplication Service, Library of Congress. 


138 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 1 


Discriminant Weights for the Siz CAT Subtests, D? 
Distance Values, and F Values for Each Discriminant Analysis 


Discriminant Weights for the Subtest 


Levels RV RC AR AF ME S Values Value 


4, 5 —.0382  .0834  .0213 .4715 .0173 —.0397 1.6039 19.7119 
5, 6 .1059 —.1026  .0176 .2665 —.0477  .1006 1.0643 13.0808 
6, 7 —.0905  .0870 —.1217  .2698  .0615 —.0494 .9284 11.4112 
7, 8 .5504 —.1216  .7345  .3339 .5932  .6848 .6685 8.2154 
8, 9 .7091 .1199  .5452 —.5915 —.3262 .1647 .1793 2.2037% 
9,10  .3089 —.0042  .6589 —.1506  .4822 —.1702 .6507 7.9969 
10,11  .1708 —.0757  .2332  .0449 —.3241  .4078 .1613 1.9819" 
11,12  .9016 —.0123 —.3287 —.2370  .0059 .2856 .4283 5.2637% 


* Significant at the .05 level of confidence 
** Significant at the .01 level of confidence 


the other group, and p is the number of variables; the F values € 
be tested with p = 6 and N, + №, — p — 1 = 293 degrees of freedom 


An examination of the D? values in Table 1 indicates that better 
discrimination is obtained in the middle grades (viz., grades four to 
eight) than in the upper grades. Indeed, the discrimination betweel 
the eighth and ninth grades does not reach the one per cent level 0 
confidence, while that between the tenth and eleventh grades does 
not even reach the five per cent level of confidence. The variability 
of grade placement scores on the subtests for the eighth and nin 1 
and tenth and eleventh graders is such that discrimination is not t 
efficient here as elsewhere among the grades. 

The AF variable is the better of the two arithmetic variables (i8 
has the greater weight) for discrimination purposes from grades fou 
to six, and AR assumes a dominant role thereafter. The RV variable 
seems to be the most important reading discriminator with the excep: 
tion of diserimination between the fourth and fifth grades, The Eng 
lish (ME) variable appears prominent in discrimination among 
grades seven to eleven, while S is relatively important between grade 
four and five, seven and eight, ten and eleven, and eleven and twelve 


variables has the greatest, discrimnant weight in the five analys 
with the greatest D? values, and the second highest, discrimina? 
weight in the sixth and seventh ranked D? values. 

The discriminant function equations for the classification of st 
dents are presented in Table 2. 


- 


ANDERSON AND LETON 139 


The discriminant functions are obtained for classifying students 
into the ? or j grade level, following Anderson and Fruchter (1957), 
Za = X'S du — M2(XiS^ X, + KS): (3) 

X’ is the one-by-six row vector of a given student's grade placement 
scores; d,, and S^' are as defined in (2); X, and X; are six-by-one 
column vectors of means for the i and j grade levels, respectively, 
and X,' and X;' are the respective transpose matrices. This func- 
tional form, assuming equal groups and equal loss functions, is simi- 
lar to the Bayesian-couched functions presented by Anderson (1951) 
and Rao and Slater (1949). Assuming i < j and using (3), a given 
student is classified with students in the ith grade if his Z;, score is 
negative, and with students in the jth grade if his Z,; score is positive. 

Discussion 

An applieation of the equations can best be illustrated with an 
€xample. Consider, for instance, a fifth grade student who enters а 
counselor’s office and requests appraisal of his grade placement 
Scores, which are as follows: RV — 5.2, RO — 49, AR — 48, AF 
= 46, ME = 5.2, and 8 = 5.3. He knows that half of his grade 
placement scores are below the 5.0 level, and also, perhaps, that it 
might repay him to spend more study time in these areas. He wishes 
to know further if, in general, he is doing as well as other fifth grade 
students. Since the student appears to be on the borderline between 
the fourth and fifth grade achievement levels (i.e., 5.0 grade place- 
ment), the counselor uses the first equation in Table 2 and computes 


2.5 = —.0382(5.2) + .0834(4.9) + .0213(4.8) + .4715(4.6) 
+ .0173(5.2) — .0397(5.3) — 2.4719 


Ш 


mo — 1112 
TABLE 2 
Discriminant Equations for Separating Adjacent Pairs of Grades 
pine 
re Zt: = —.0382 RV +.0834 RC +.0213 AR +4715 AF +.0173 ME —0397 8 — 24719 
һе = .1050 RV —.1026 RC -+.0176 AR +.2665 AF —.0477 ME +.1006 8 — 1.9886 


Zw: =—.0905 RV 4.0870 RC —.1217 AR +.2608 AF +.0615 МЕ —.0494 8 — 1.0190 
‚ 9 gut T 5504 RV —1216 RC +.7345 AR +.3339 AF + ,5932МЕ +.6848 8 —11.9964 
9,10 zi", = 7091 RV +.1199 RC +.5452 AR —.5915 AF —.3262 ME 4-.1047 5 — 5.6819 
10,11 720 = -8089 RV —.0042 RC +.6589 AR —.1506 AF +.4822 ME —.1702 8 —11.0801 

4,41 = 1708 RV —.0757 RC +.2322 AR +.0449 АЕ —.3241 ME +.4078 8 — 4.6107 


жы. Zu = 9016 RV —.0123 RC —.3287 AR —.2370 AF +.0059 ME +.2856 8 — 7.1200 


140 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


The counselor notes that the student’s discriminant score is negative 
and thus must inform the student that his over-all achievement level 
more closely resembles that of fourth grade students than fifth grade 
students. 

The equations presented in this article are optimizing equations 
in the sense that they build a new variable from a linear combina- 
tion of the six subtests. The discriminant analysis maximizes profile 
differences between adjacent grade levels by determining unique co- 
efficients (i.e., weights) for each of the subtest scores, with methods 
such as found in Kendall (1957), Rao (1952), Rao and Slater (1949) 
and Tiedeman (1954). Several other procedures have traditionally 
been used in the summary of achievement profiles. One such method 
involves taking an average over the subtests (ie. applying a unit 
coefficient to each score) such as noted initially in the example above. 
This method, as well as any other arbitrary weighting method, lacks 
efficiency. Lubin (1957) discusses analysis of a similar problem with 
beta weights in multiple correlation. Another method might involve 
using only one or several of the subtest scores with the judgment 
that these subtests in the profile should receive most, or exclusive, 
weight for classification. A similar systematic procedure could be 
developed by testing the significance of the discriminant weights in 
discriminant analysis, but the arbitrary exclusion of variables is not 
recommended. In a word, the subtest combinations afforded by the 
equations in Table 2 offer better discrimination than any other set 
of subtest score combinations. The approach is quite generalizable, 
also, for comparing individuals with one another as well as classify- 
ing individuals into groups. Other approaches, such as used by 
Sawrey, Keller, and Conger (1960), following Cronbach and Gleser 
(1953), neglect joint distribution probability characteristics (ef 
Tiedeman, 1954) and are not recommended by the present authors. 

The obtained coefficients in Tables 1 and 2 are of the least-squares 
type and as such are subject to change with different subtest vari” 
ance-covariance matrices. The application of the obtained equations 
in other school systems is therefore not recommended without cross- 
validation. For populations in which the test distributions yield 
markedly different variance-covariance matrices, the application of 
these discriminant coefficients would not provide maximum effici- 
ency. The methodolgy rather than the obtained results herein, there- 
fore, may be of more interest to CAT users and researchers. 


— у ян 


ANDERSON AND LETON 141 


Other applications of the discriminant function are likely to pro- 
duce some subtest coefficients with a negative sign such as appear 
in the equations herein. Some objections occasionally arise from test 
users because of the opinion that each subtest should contribute posi- 
tively to discrimination between groups. This position neglects con- 
sideration of subtest covariances and the optimum orientation of the 
discriminant plane (cf., Hoel, 1947, pp. 121-126). A system for dis- 
crimination could be constructed that would require all of the co- 
efficients to be positive, similar to Lev’s (1956) method of obtaining 
totally positive beta weights in multiple correlation, but the result- 
ing discrimination would very likely not be as good as the more gen- 
eral, optimum method proposed here. 

The results of the present study differ to some extent from the re- 
sults of a previous study conducted by Anderson (1960). Anderson 
used four of the CAT subtests (RV, RC, AR, and AF) to discrimi- 
nate between fourth, fifth, and sixth graders and found that the read- 
ing subtests received consistently greater emphasis (i.e., weights) in 
discrimination among the grade levels; in the present study, AF is 
the most important discriminating variable at these grade levels. 
Anderson’s samples, however, were taken from the Los Angeles school 
system and so comparisons with these results might reflect differences 
in the populations under study rather than inconsistencies in results. 
In particular, the Los Angeles students might be more homogeneous 
with respect to the reading variables while the Santa Monica stu- 
dents might be more homogeneous with respect to the arithmetic 
Variables. Discriminant equations, in fact, could be used with widely 
standardized tests, such as the CAT, to compare achievement, and 
also to compare the influence of different curriculum emphases be- 
tween grades and between school systems in just this manner. 

: Discriminant analysis of achievement data might well provide 
Insights into more general achievement problems. Shaw and Mc- 
Cuen (1960), for instance, studied under- and over-achievement 
from grades three to twelve. Discriminant analyses in a given school 
System could provide information about the important variables for 
Identifying groups of students, within grade levels as well as between 
grade levels, who display particular deviations in various areas of 
achievement, Thus, information for alleviating general curriculum 
Problems, such аз planning for remedial and enrichment classes, may 
оше available through discriminant analysis. 


142 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


It was noted in the Results section that better discrimination was 
obtained with the CAT subtests in the middle grades (grades four 
to eight) than in the upper grades. Further studies in other school 
systems should indicate whether this condition is peculiar to the 
Santa Monica City Schools or whether it is general throughout all 
school systems. If this condition is shown to persist, the scoring 
procedures for upper grade levels CAT subtests should probably be 
revised to be sensitive to achievement differences extant in these up- 
per grades. 


Summary 


Random samples of 150 students were selected from each grade, 
four through twelve, in the October, 1960, Santa Monica City Schools 
testing program. All students had taken Form W of the California 
Achievement Test, and the grade placement scores for all six sub- 
tests were selected for use in the analyses of the study because: (a) 
they offer the most direct comparisons of achievement levels from 
one subtest to another; (b) they allow for comparisons between the 
several levels of the test (viz., Elementary, Junior High, and Ad- 
vanced); and (c) they are usually the most available set of scores 
for teachers, administrators, and counselors. A discriminant analysis 
was performed, using the six subtests, between each pair of adjacent 
grades (e.g., grades four and five, grades five and six, and so on), 
and equations were presented that optimize classification of stu- 
dents as a function of a linear compound of the six subtest scores. 
The results revealed that better discrimination was obtained through- 
out grades four to eight than in later grades. Further, the results 
and the methodology were indicated to be of value in achievement 
studies across school systems as well as within school systems, in 
studies of a more general nature, such as curriculum planning, and 


for use in possible subsequent revisions of the California Achieve- 
ment Test. 


REFERENCES 


Anderson, H. E., Jr. “Some Test Results and a Method of Teaching 
Testing Courses.” Paper read at the annual meeting of the West- 
егп ei e go Association, San Jose, California, 1960. 

Anderson, Н. E. Jr. “А Study of Language and Nonlanguage 
Achievement.” EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 
XXI (1961), 1037-1038. 

Anderson, H. E., Jr. and Fruchter, B. "Statistical Procedures: The 


H 


9 


ANDERSON AND LETON 143 


Discriminant Function for the Two-Group Case.” Research 
Guide Number 2. Psychometric Laboratory, Department of Edu- 
cational Psychology, University of Texas, 1957. è 

Anderson, H. E., Jr. and Fruchter, B. “Some Multiple Correlation 
and Predictor Selection Methods.” Psychometrika, XXV (1960), 
59-76. 

Anderson, H. E., Jr. and Slivinske, A. J. “A Study of Intelligence and 
Achievement at the Fourth, Fifth, and Sixth Grade Levels.” Jour- 
nal of Experimental Education, in press. : 

Anderson, T. W. “Classification by Multivariate Analysis.” Psycho- 
metrika, XVI (1951), 31-50. ( 

Cronbach, І. J. Essentials of Psychological Testing. New York: 
Harper and Brothers, 1960. 

Cronbach, L. J. and Gleser, С. C. “Assessing Similarity Between 
Profiles.” Psychological Bulletin, L (1953), 456-478. 

Hoel, P. G. Introduction to Mathematical Statistics. New York: 
John Wiley & Sons, 1947. А Р 
Kendall, М. б. A Course in Multivariate Analysis. London: Griffin 

and Company, 1957. з 

Lev, J. “Maximizing Test Battery Prediction When ће Weights are 
Required to be Non-Negative.” Psychometrika, XXI (1956), 
245-252. | 

Lubin, A. “Some Formulae for use with Suppressor Variables." Epu- 
ae AND PSYCHOLOGICAL MEASUREMENT, XVII (1957), 286— 

Mahalanobis, P. C. “On the Generalized Distance in Statistics. Pro- 
i of The National Institute of Science, India, XII (1936), 


2 C. a Advanced Statistical Methods in Biometric Research. 

ew York: John Wiley & Sons, 1952. : 1 

Rao, С. В. and Slater, P. “Multivariate Analysis Applied to Differ- 
ences Between Neurotie Groups." British Journal of Psychology: 
Statistical Section, II (1949), 17-29. Жүл 

Sawrey, W. L., Keller, L. and Conger, J. J. “An Objective Method of 
Grouping Profiles by Distance Functions and its Relation to 
Хасіог Analysis.” EDUCATIONAL AND PSYCHOLOGICAL MEASURE- 
MENT, XX (1960), 6! 3. 

Shaw, M. C. d Me en. 7 T. “The Onset of Academie Under- 
Achievement in Bright Children.” Journal of Educational Psy- 
chology, LI ( 1960), 103-108. Р i с 

Stake, R. Е. “ Overestimation' of Achievement with the California 
Achievement Test." EDUCATIONAL AND PSYCHOLOGICAL MEASURE- 

Ti MENT, XXI (1961), 59-62. : 

ledeman, D. V. “A Model for the Profile Problem.” Proceedings, 
1958 Invitational Conference on Testing Problems, Educational 
Testing Service, 54—75. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 1, 1963 


RESPONSE SET ON INTEREST INVENTORY TRIADS' 


DAVID P. CAMPBELL лхо WAYNE W. SORENSON 
University of Minnesota 


81мов Cronbach called attention to response set (Cronbach, 1946; 
Cronbach, 1950) there has been an awareness among test makers 
that the decision about item format can be one of the most important 
decisions made in the construction of any new instrument. For, as 
Cronbach showed, an individual's response can be influenced not 
only by the content of the item, but also by the form of presentation. 
This influence he called response set and defined it as "...any 
tendency causing a person consistently to give different responses to 
test items than he would when the same content is presented in а 
different form" (Cronbach, 1946, p. 476). An example might be 
True-False tests where there is a tendency for subjects to pick more 
True answers than False, particularly when they are unsure of them- 
selves. Cronbach went on to list many more examples of response 
Sets, their effects (normally a decrease in validity), and ways to 
Safeguard against them. 

і Не also suggested the possibility that various categories of people 
might have different response sets. If instances of this could be 
found, then response sets could be utilized in psychological instru- 
ments to differentiate between those categories. The study reported 
here was an attempt to discover any response set due to location of 
Statements within interest inventory triads, and, if such were found, 
lo determine if they were effective in distinguishing between various 
Occupational groups. 

The triad as it is used in the Kuder Preference Record and Clark’s 


eee 
i Computer time for the data analysis was furnished by the Numerical 
Center at the University of Minnesota. 


145 "4 


146 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Minnesota Vocational Interest Inventory (MVII) is a type of 
forced-choice item. Here the individual is faced with a triad of 
statements and asked to select one he would like most and one h 
would dislike most. An example is: 
Like Dislike 

А. Fix a doorbell () 

B. Make coffee ( 
( 


) 
C. Sort mail ) () 


likes A and dislikes C, he has effectively ranked the statements ABO; 
The other possible rankings are ACB, BCA, BAC, CAB, and CBA 

This restriction in the possible ways the statements can be reacted 
to should decrease considerably any response set, and the individ- 


there truly are differences in the way that occupational groups reach 
to item content, these differences should appear, unobscured by any 
“test-taking bias,” and the inventory should prove to be a уа id 
instrument. 

The Minnesota Vocational Interest Inventory (МУП), with 
which this study deals, has been established as a valid, reliable in- 
strument, useful for differentiating between occupational interes 
in the skilled trades (Clark, 1961). If any response set is present, it 
does not seem to be exerting any gross harmful effect, althougl 
there is no assurance that some mild response set is not affecting 
validity. Or, another possibility that Clark has mentioned (Clark, 
1961, p. 16) is that any response set present might actually be 0 
ating to help separate occupational groups. The МУП. keys We 
developed empirically, using actual item response differences © 
tween occupations. If response sets are common to all members 0! 
an occupation, this could help to differentiate this occupation fron 
others. If this is the case, we should certainly be aware of it 8n 
spend some time trying to develop stronger response sets for thi 
very purpose. 

On the MVII, a response set would most likely show itself by 1 
individual's choosing various patterns of response regardless 0 
item content. Thus a person might tend to pick the first statemen 
as liked best and the last statement as liked least. If the individus 


CAMPBELL AND SORENSON 17 


makes his selection strictly on item content, ignoring the arrange- 
ment, and if the arrangement of statements within the triad is ran- 
dom?, we would expect each pattern to occur roughly one-sixth of 
the time. If this is not true, a response set would apparently be 
present. 

The null hypothesis under test is that the individual’s frequency 
of selection of the various patterns is not significantly different from 
chance. To test this hypothesis, it is necessary to tabulate, for each 
person, the frequency of each pattern over all the triads. Using the 
chi-square statistic, this can be compared to the expected proportion 
of one-sixth. 

To test the presence of a response set unique to a given occupa- 
tion, and useful for differentiating this occupation from others, it is 
necessary to select a sample of men from that occupation and tabu- 
late а pattern frequency over all individuals on all triads. These 
frequencies are then compared to the same frequencies collected for 
а reference group of men-in-general. The null hypothesis is that the 
two groups do not differ in frequency of pattern selection. 

If differences appear, they can be used to separate the occupation 
from the men-in-general group, and should add to any validity de- 
rived from the item content. 


Method 


_ The raw data used in this study were collected by К. E. Clark 
in his standardization of the MVIL. Several occupational groups and 
the tradesmen-in-general (TIG) group were used. 

Frequencies of the possible rank ordering for each triad were ac- 
cumulated, first for each person across all triads, and then for each 
Occupational group across all individuals. 

A sample of printers was used to test the presence of an individual 
Tesponse set. The actual frequency of response to each of the six 
Patterns for each individual was compared to the theoretical pro- 
Portion of one-sixth using the chi-square statistic. 

Figure 1 shows a distribution of these chi-squares. While this dis- 
tribution shows a mild resemblance to the theoretical chi-square 
curve, there are considerably more extreme values than could be 
expected by chance, i.e., twenty-five per cent of the scores exceed 
M 


* Of the actual arrangement, Clark says, "They were grouped in threes in à 


fairly haphazard fashion . . . (Clark, 1961, p. 17) 


148 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
FIGURE 1 


DISTRIBUTION OF CHI- SQUARES 
FOR 134 INDIVIDUALS 


25 


20 $ 
fem THEORETICAL X? CURVE 


FREQUENCY 


EX? 1096.4 
(with 670 df, 
significant ot,O | level) 


ОСТА е Бю 12 1а 16.16 20 22 24 26 ?8 30 
CHI-SQUARE VALUE 


the five per cent significance level, and the summated chi-square 
also is significantly higher than would be expected by chance. It 
appears that there is some non-random influence, What the practical 
influence of this non-randomness might be is difficult to assess, but 
some information can be gleaned by inspecting the most deviant 
individuals. In the most extreme case, an individual selected one 
pattern 30 per cent of the time as contrasted with the chance ех 
pectation of 16.67 per cent. While this is sufficient to give a statis- 
tically significant chi-square, it does not seem to be enough to be 
concerned about in the practical application of the inventory, PAF 
ticularly since this is the most extreme example in a sample of 1 
persons. While it might slightly decrease the validity of the inven" — 
tory, it is unlikely that the deviation is sufficient to warrant much 
concern over its use in a predictive sense, or any attempt to сай- 
brate the individual's location bias before scoring his answer sheet 
It seems safe to conclude that the MVII is, for all practical pur- 
poses, free of any individual pattern response set. 

In an attempt to learn more about this slight non-randomness 


CAMPBELL AND SORENSON 149 


among the individual's responses, and also to check for the presence 
of any location bias common to one occupation, frequencies of re- 
sponse patterns were tabulated for several oceupational groups over 
all triads. Table 1 gives this data. 

Although there are some deviations from a perfectly flat dis- 
tribution, there is no apparent response set peculiar to any one 
group. While there are distinct and consistent differences in the 
frequencies, the differences are very stable across the various groups, 
and they are of small practical significance. The differences in the 
frequency of selection seem to be caused by popularity of the C 
choice and the unpopularity of the B choice. From this data it is 
impossible to say whether the differences in frequencies are caused 
by a response set across all groups or whether they are caused by 
the location of popular items within the triads. If, in the initial lay- 
out of the inventory, there was a slight tendency to place popular 
statements in the C position and unpopular ones in the B position, 
then differences in response frequencies would be expected. On the 
other hand, if the popular statements were spread evenly throughout 
the positions, the data would indicate a slight position bias in favor 
of statement С. 

In an attempt to determine which of these was the most reason- 
able explanation, two other groups were compared. Both of these 
groups were drawn from the patient population of the Vocational 
Counseling Service of the Veterans Administration Hospital in 


TABLE 1 
Per Cent Response to the Possible Rank Orders 
by Various Occupational Groups 
n na 
Occupational Possible Rank Orders 

Groups BAC ABC BCA CBA ACB CAB N 
Tradesmen-in-general 14 17 14 20 16 19 240 
lectricians B n 739 34 ae) 19» 009 
Printers 14-137 4. 160049197 A 
, Plumbers 144 ^j. 3, 0. 15 SIR EE 
Painters 14 av Va dei АЕ г. 
IBM Operators ТОЕТ, 
Milk Wagon Drivers 15 "d$" ke —19 — 17.0 12 NR 
Retail Sales 14 738; 38^. ТЕ АЙ e 
achinists B ar ge 30 09 30005 
inen is 14 216. 1855. 20. 17 ИЗИ 
Sheet Metal Workers 13 17 a 20 16. 1951908 
vA Control Group 15 17 93 o 1970427 EON 


150 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Minneapolis. One group of 95 people, used as a control group, 
answered the MVII in its usual form, while an experimental group 
answered a revised form which had the statements randomly rear- 
ranged within the triads. The individual triad still contained the 
same statements, but they were in a different order. 

If the differences in pattern selection were due to à location bias, 
this rearrangement of items should have little effect on the fre- 
quency of pattern selection. But if they were caused by a dispro- 
portionate number of popular items in the C position, then the 
frequencies of the various patterns should be changed somewhat. As 
the data in Table 1 indicate, the latter explanation is probably the 
correct one. The control group that took the standard МУП selected 
the patterns in about the same frequency as all of the other groups, 
but the experimental group that took the revised MVII selected the 
patterns in mildly different frequencies. 

Because these percentages include multiple responses from each 
individual and thus are not made up of independent observations, 
no satisfactory test of significance was found. Simple inspection of 
the data indicates that, though the changes in frequencies are small, 
they are the only deviations from the frequencies of the other groups. 
For example, this experimental group is the only group selecting 
any pattern ranking B first (BAC) more frequently than either of 
the C first (CBA, CAB) patterns. 

With this data in hand, it does not seem grossly imprudent to 
decide, without a statistical test, that this last group did respond 
differently to the rearranged inventory, and thereby to conclude that 
it is item content and not location which causes the differences in 
pattern selection. 

Another way to analyse this comparison between the control and 
experimental groups that is amenable to statistical test is by in- 
dividual triad. The percentage response to each of the six possible 
orders of the original triad can be compared with the percentages on 
the rearranged triad. If the rearrangement has no effect on the in- 
dividual’s ordering of the triad of statements, these percentages 
should be the same, or differ only by chance. These comparisons 
were made for each triad, using a contingency chi-square. Figure 2 
shows the distribution of these chi-squares. 


3 We would like to acknowledge the cooperation of Dr. Arthur Bradley і2 
collecting this data. 


CAMPBELL AND SORENSON 151 


FIGURE 2 
DISTRIBUTION OF CHI-SQUARES 
E FOR 190 TRIADS 
25 i 
4-- THEORETICAL X? CURVE 
20 
z 
2 
ш 15 
- 
9 
[4 
и. 
10 
5 
0 


> ا‎ 
o 2 ав 8 10 2 M 16 18 20 22 24 26 28 30 


CHI-SQUARE VALUE 


This distribution closely resembles the theoretical chi-square dis- 
tribution, the number of extreme values is about what would be 
expected by chance, and the summated chi-square is non-significant. 
(Note that the summated chi-squares are correlated.) Changing 
the location of the statements within the triads has apparently no 
effect on the rank ordering of the statements. The conclusion seems 
even clearer that triads of interest statements used in this manner 
can be considered free of any location bias. 

Discussion 

These results suggest strongly that the МУП, using the triad item 
format, is essentially free of response set concerned with item loca- 
tion (within the triad). As item location is the most obvious source 
of response set on this inventory, it seems likely that the MVII can 
be considered free of any serious influence of this type. 

Because no response set of this type was found, it necessarily 
follows that this response set is of no help in differentiating between 
Occupational groups. 

The finding that there were consistent differences between the 


152 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


various patterns of response, due mainly to the location of the 
popular and unpopular items, would suggest that the cautious re- 
searcher should be careful to assign items randomly to their loca- 
tions in the triad. While there is no apparent response set operating 
on these triads, it is conceivable that one could be drummed into 
individuals by always placing the most popular item in the same 
location within the triad. Random arrangement, balanced according 
to item popularity, should be an adequate precaution against any 
systematic influence of this nature. 


Summary 


This study compared the frequencies of various patterns of re- 
sponse to the triads of the Minnesota Vocational Interest Inventory, 
both for indivduals and for occupational groups at the skilled trades 
level. 

The results indicate that this inventory is essentially free of any 
response bias caused by item location within the triad. While some 
individuals choose response patterns in frequencies differing from 
chance, none of the deviations are, for practical purposes, very large. 
Since frequencies of response to various patterns were constant across 
the occupational groups, no response set was found that would be 
useful in differentiating between occupations. 


REFERENCES 


Clark, К. E. Vocational Interests of Nonprofessional Men. Min- 
neapolis: University of Minnesota Press, 1961. 

Cronbach, L, J. “Response Sets and Test Validity." EDUCATIONAL 
AND PSYCHOLOGICAL MEASUREMENT, VI (1946), 475-494. 

Cronbach, L. J. “Further Evidence on Response Sets and Test De- 


sign.” EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, ХІ 
(1950), 3-31. 


EDUCATIONAL AND PsYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 1, 1963 


INCREMENTAL VALIDITY: A RECOMMENDATION! 


LEE SECHREST 
Northwestern University 


Tun 1954 APA publication Technical Recommendations for Psy- 
chological Tests and Diagnostic Techniques established minimum 
standards to be met in the produetion and promotion of psycho- 
metric instruments. Since that time there have appeared a consid- 
erable number of articles elaborating or extending the considerations 
involved in developing tests (e.g., Cronbach & Meehl, 1955; Jessor 
& Hammond, 1957; Loevinger, 1957; Campbell & Fiske, 1959; Bech- 
toldt, 1959; Campbell, 1960). In one of the most recent develop- 
ments, Campbell and Fiske (1959) have suggested that a crucial 
distinction is to be made between convergent and discriminant 
Validity, It is necessary to demonstrate not only that a measure 
Covaries with certain other connotatively similar variables, but that 
its covariance with other connotatively dissimilar variables is 
limited. 

Campbell (1960) has suggested several possible additions to rec- 
ommended validity indicators, all of which focus on the problem of 
discriminant validity, i.e., the demonstration that а test construct 
18 not completely or even largely redundant with other better es- 
tablished or more parsimonious constructs. He has suggested, for 
example, that correlations with intelligence, social desirability and 
Self-ratings should be reported since these variables are likely to be 
conceptually and theoretically simpler than most of our constructs. 

1 & new test proves to be reducible to an intelligence or social de- 
Srability measure, its raison d’etre probably vanishes. 


for Pw Writer wishes to thank Donald T. Campbell and Douglas N. Jackson 
elpful suggestions on an earlier version of this manuscript. 


153 


154 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


It is the purpose of this note to suggest an additional validity 
construct and evidence which should be presented in the basic pub- 
lications concerning any test which is intended for applied, predic- 
tive use. 


Incremental Validity 


Almost without exception evidence which is presented to support 
the validity of a psychological test is presented in the form of some 
improvement over results which would be expected by chance. How- 
ever, in clinical situations, at least, tests are rarely, if ever, used in 
a manner consistent with the chance model. Almost always Ror- 
schachs are interpreted after interviews, reading of case reports, con- 
ferences and the like. The meaning of a report that some Rorschach 
variable will predict better than chance becomes obscure under 
those circumstances. It seems clear that validity must be claimed 
for a test in terms of some increment in predictive efficiency over the 
information otherwise easily and cheaply available. 

Cronbach and Gleser (1957, pp. 30-32) and, as they point out, 
Conrad (1950), have both discussed the problem of the base against 
which the predictive power of a test is to be evaluated. Cronbach 
and Gleser declare, "Tests should be judged on the basis of their 
contribution over and above the best strategy available, making use 
of prior information” (1957, p. 31). They do indicate that tests may 
be valuable in spite of low correlations if they tap characteristics 
either unobservable or difficult to observe by other means. Shaffer 
(1950), p. 76) also suggested, “Опе can . . . study the degree to which 
the clinician is valid with and without the aid of a certain tech- 
nique, and thereby assess the value of the test indirectly." We are 
not so sure that such an assessment is completely indirect. 

In light of the above argument it is proposed that the publications 
adduced as evidence for the utility of a test in a clinical situation— 
and probably for most other uses—should include evidence that the 
test will add to or increase the validity of predictions made on the 
basis of data which are usually available. At a minimum it would 
seem that а test should have demonstrated incremental validity be" 
yond that of brief case histories, simple biographical information 
and brief interviews. A strong case can also be made to demand that 
a test contribute beyond the level of simpler, e.g., paper and pencil, 


LEE SECHREST 155 


tests. As a matter of fact, Campbell’s recommendation that new 
tests be correlated with self-ratings is quite akin to some aspects of 
incremental validity. 


Adequate Statistical Evidence 


When a test is added to a battery, the usual way to express its 
contribution is either by a partial correlation or by an increment to 
a zero order or multiple correlation. There is, perhaps, one objection 
to the partial or multiple correlation as a demonstration of incre- 
mental validity. That is, the increase, even if significant, is of some- 
what undetermined origin and obscures the exact nature of the in- 
crement achieved. 

Consider the matrix of correlations: 


1 2 0 
1 60 40 
2 40 


0 


in which 1 and 2 are predictors of criterion 0. The multiple R120 = 
45 and the partial тоол = .22. Both values might be considered to 
represent, improvements over the zero order correlations. And yet, 
without knowing the reliabilities of 1 and 2 we will be unable to dis- 
cern whether 2 contributes to the prediction of 0 because it repre- 
sents a theoretical variable distinct from 1 or whether 2 has only the 
Same, and informationally redundant, effect of increasing the length 
and, hence, the reliability of Test 1. It will often be important to 
know whether an increment results from a Spearman-Brown proph- 
ёсу operation or from some contribution of theoretical importance. 
Kelley (1927) suggested quite a number of years ago that when 
Correlations between intelligence and achievement measures are 
Properly treated the two measures prove to be almost completely 
overlapping. Thus, in his view, the two kinds of measures only com- 
Ine to form a longer and more reliable measure of a single variable. 

One solution to the problem might be the correction of inter-test 
Eee for attenuation. If the reliabilities are so low that the 
rected correlation approaches unity, no increment to Ё пога 
'Enifleant, partial correlation will ensue. In the above example, 
ES Teliabilities for 1 and 2 of only .60, the correlation between 

em would become unity, the multiple would be .40, and the partial 


156 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


r .00. On the other hand, if both variables had reliability coefficients 
of .90, the correction for attenuation would have little effect on 
either Ё or partial correlation.? 


Exemplary Instances of Incremental Validity Research 


Demonstrations of incremental validity are not common in re- 
search literature except in prediction of academic performance. Un- 
fortunately, where they occur the data often are discouraging. Winch 
and More (1956) used a multiple correlation technique in an at- 
tempt to determine the increment produced by TAT protocols over 
a semi-structured interview and case history material. Their results 
provide no basis for concluding that the TAT contributes anything 
beyond what is given by interviews or case histories. Sines (1959) 
discovered that the Rorschach apparently did yield better than 
chance predictions, but it seemingly not only did not add to other 
information obtained from interviews and a biographical data sheet, 
but it actually produced a net decrement in predictive accuracy. 


This in spite of better than chance “validity.” Kostlan (1954) found , 


that judges made better than chance inferences about patients’ be- 
havior from only “minimal data” (age, occupation, education, 
marital status and source of referral), When test results were used 
to make the same judgments, only the social history yielded more 
accurate inferences than those made from simple biographical facts. 

In the general area of prediction of academic success, data arè 
widely available indicating the increment over previous grades af- 
forded by predictions based on psychometric data. Even in predict- 
ing academic performance, however, it is not always clear that the 
use of test data accomplishes anything beyond increasing the re- 
liability of the ability measure based on grades. If treated as 618" 
gested above, it might be possible to determine whether a test con- 
tributes anything beyond maximizing the reliability of the general 
ability measure afforded by grades. Ford (1950) has presented фи 
concerning the prediction of grades in nursing school making use af, 
among other measures, the Cooperative General Science Tes 
(CGST) and high school point average (HSPA). The correlation 
matrix between these variables is: 


216 is to be noted that correction for attenuation of the validity values * 
not suggested and should not be done. 


LEE SECHREST 157 
CGST HSPA Grades , 


1. CGST .33 57 
2. HSPA 51 
0. Grades 


The multiple correlation R42. is .66 and the partial 7,9 is .50. The 
split-half reliability of the CGST has been reported to be .88. While 
no reliability estimate for HSPA is known to the writer, several re- 
searchers have reported reliabilities for college grades (Anderson, 
1953; Bendig, 1953; Wallace, 1951). If we take the median value of 
the three reported values of .78, .80, and .90 as a likely estimate for 
HSPA and then correct the тә for attenuation, the .33 becomes .40. 
The multiple correlation then drops only to .65 and the partial cor- 
relation only to .46. It is obvious that for the prediction of grades in 
nursing courses the use of the Cooperative General Science Test re- 
sults in an increment in validity over high school grades and that 
the increment may be regarded as more than a contribution to re- 
liable measurement of а single factor. 


Summary 
It is proposed that in addition to demonstrating the convergent 
and discriminant validity of tests intended for use in clinical situa- 
tions, evidence should be produced for incremental validity. It must 
be demonstrable that the addition of a test will produce better pre- 
dictions than are made on the basis of information other than the 
lest ordinarily available. Reference to published research indicates 
that situations may well occur in which, in spite of better than 
chance validity, tests may not contribute to, or may even detract 
from, predictions made on the basis of biographical and interview 
Information. It is further suggested that, when correlations for a 
Sven test are entered into a multiple correlation or partial cor- 
relation, the inter-predictor correlations be corrected for attenua- 
tion to determine whether an increase in the multiple or partial 
correlations is to be attributed to a mere increase in reliability of 

measurement of the predictor variable. 


ae REFERENCES н 
"rien Council on Education: The Cooperative Test Service. “A 
dne ooklet on Norms.” New York: 1938. f : 
Teen Psychological Association, Committee on Psychological 
ests. Technical Recommendations for Psychological Tests and 


158 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Diagnostic Techniques. Washington, D. C.: APA, 1954. (Re- 
printed from: Psychological Bulletin Supplement, LI (1954), 
619-629. 

Anderson, Scarvia B. “Estimating Grade Reliability.” Journal of 
Applied Psychology, XXXVII (1953), 461—464. ` 

Bechtoldt, H. P. “Construct Validity: A Critique.” American Psy- 
chologist, XIV (1959), 619-629. 

Bendig, A. W. “The Reliability of Letter Grades.” EDUCATIONAL 
AND PSYCHOLOGICAL MEASUREMENT, XIII (1953), 311-321. 
Campbell, D. T. “Recommendations for APA Test Standards Re- 
garding Construct, Trait, or Discriminant Validity." American 

Psychologist, XV (1960), 546-553. 

Campbell, D. T. and Fiske, D. W. “Convergent and Discriminant 
Validation by the Multitrait-Multimethod Matrix.” Psychologi- 
cal Bulletin, LVI (1959), 81-105. 

Conrad, H. “Information Which Should Be Provided by Test Pub- 
lishers and Testing Agencies on the Validity and Use of Their 
Tests.” In Proceedings, 1949, Invitational Conference on Testing 
Problems. Princeton, N. J.: Educational Testing Service, 1950, 

Cronbach, L. J. and Gleser, Goldine C. Psychological Tests and Per- 
sonnel Decisions. Urbana: University of Illinois Press, 1957. , 

Cronbach, L. J. and Meehl, P. E. “Construct Validity in Psychologi- 
cal Tests.” Psychological Bulletin, LII (1955) , 281-302. 

Ford, A. H. “Prediction of Academic Success in Three Schools of 
ne" Journal of Applied Psychology, XXXIV (1950), 186- 

Jessor, R. and Hammond, K. R. “Construct Validity and the Taylor 
Anxiety Scale.” Psychological Bulletin, LIV (1957), 161-170. 

Kelley, T. L. Interpretation of Educational Measurements. New 
York: World Book Company, 1927. BAT 

Kostlan, A. “A Method for the Empirical Study of Psychodiagnosis. 
Journal of Consulting Psychology, XVIII (1954), 83-88. — 

Loevinger, Jane. *Objective Tests as Instruments of Psychological 
Theory.” Psychological Reports, III (1957), 635-694. Mono- 
graph Supplement 9. 

Shaffer, L. “Information Which Should Be Provided by Test Pub- 
lishers and Testing Agencies on the Validity and Use of The 
Tests. Personality Tests.” In Proceedings, 1949, Invitationd 
Conference on Testing Problems. Princeton, N. J.: Education® 

_ Testing Service, 1950. 

Sines, L. K. “The Relative Contribution of Four Kinds of Data, 
Accuracy in Personality Assessment.” Journal of Cons ting 
Psychology, XXIII (1959), 483—492. 

Wallace, W. L. “The Prediction of Grades in Specific Colleen 
Mira Journal of Educational Research, XLIV (1951), 58 


Winch, R. F. and More, D. M. “Does ТАТ Add Information (012; 
terviews? Statistical Analysis of the Increments.” Journa 
Clinical Psychology, XII (1956), 316-321. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 1, 1963 


HIGH SCHOOL PERFORMANCE OF UNDERAGE PUPILS 
INITIALLY ADMITTED TO KINDERGARTEN ON THE 
BASIS OF PHYSICAL AND PSYCHOLOGICAL 
EXAMINATIONS 


JAMES R. HOBSON 
Brookline (Massachusetts) Publie Schools 


Introduction 


LOGICALLY, if we are to provide for individual differences after 
à child enters school, it seems reasonable to recognize some of the 
More basic and obvious differences as he approaches school age and 
to develop an elastic system of school admission based upon those 
differences which are objectively measurable and which do not, in 
the main, depend upon environment and training. In fact, early ad- 
Mission may be the ideal method of acceleration. 

The Brookline Plan of Underage Admission. For the past thirty- 
five years, the Public Schools of Brookline have admitted to kinder- 
garten all educable children who have attained a minimum chrono- 
logical age of four years nine months as of October 1. For the first 
fteen of these years, children from three to nine months or more 
below this age were admitted on trial following an individual psy- 
chological examination by the Department of Child Placement and 
р Physical and health examination administered by the Medical 
Mer. Approximately 115 children were admitted annually under 
Er plan and individual records of the later school performance of 
children so admitted have been carefully kept. The results of 


early research undertaken to appraise the validity of the criteria 
Used showed: 


1) A significantly high positive relationship between mental age 
159 


160 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


at entrance and both teachers’ marks and standardized achieve- 
ment test results through grade four. 

2) Average marks and achievement test results of the underage 
group higher than those of the other children in every grade ex- - 
cept kindergarten, where the marks of the underage children were 
slightly lower on the average despite higher average ratings on 
standardized reading readiness tests. 


Later Studies. The results (Hobson, 1948) of ten years’ operation of 
this elastic system of admission confirmed both conclusions of the 
earlier study and further indicated: 


1) that the margin of average superiority of the selected under- 
age children increased as they progressed through the eight grades 
of elementary school; 

2) that the least successful group of underage children were 
those admitted with an M.A. rating of 5-0, which was the mini- 
mum requirement; 

3) that the next least successful group was composed of those 
children more than six months underage, although some individ- 
uals in this group were very successful; 

4) that underage children originally admitted by test not only 
exceeded their fellows scholastically on the average but were Te 
ferred less often for emotional, social, and other personality mal- 
adjustments; and 

5) that, because of lower ages for school admission in large com 
munities nearby and the frequent changes of residence in а metro- 
politan area, by grade six there were more underage children who 
had moved in than there were in the group originally admitted by 
test. By grade eight there were nearly twice as many, and nearly 
half of the group admitted by test had departed. As a result of 
this research, the minimum M.A, required for admission оп trial 
was raised to 5-2 and the privilege of early admission was limi 

to those within six months of the required minimum chronological 
age for all children. This plan has been followed for several years: 


The Current Investigation 
The three main purposes of the present study are: 
1) to compare high school scholastic performance of undera 


JAMES В. HOBSON 161 


children, originally admitted to kindergarten by test (ABT), with 
performance of the others in their class; 

2) to compare high school activity partieipation by underage, 
test-screened children and their classmates; and 

3) to gain some idea as to the relative success in college admis- 
sions of the two groups. 


Two general investigations were made. The first was a comparison 
of distinctions received by the 550 underage test-sereened children 
graduates in ten classes and the 3,891 other pupils in those classes. 
The distinctions chosen were: 1) graduation with honors (which de- 
notes an all A and B record in grades 11 and 12) and 2) election to 
Alpha Pi, an undergraduate honorary society for which the criteria 
of selection are participation and prominence in extra-curricular 
activities as well as excellence in scholarship (the latter alone does 
not suffice). The second study was a more detailed analysis of 
two early classes whose scholastic performance through elementary 
school was reported in detail in an earlier article (Hobson, 1948). 
Analyses are made of scholastic performance in high school, partici- 
pation in extra-curricular activity, and data on college admissions. 


Comparison of Distinctions Received 


A complete summary of the comparative distinctions received by 
xe Underage graduates who were originally admitted by test and 
the other graduates over a ten-year period is presented in Table 1. 
It should be noted that there is an important factor which has some 
bearing on the performance of underage children as compared with 

Ose of their older classmates. Because of lower entrance ages in 


TABLE 1 


Percentages of Academic and Eztra-Curricular Distinctions of Underage, 
est-Screened Pupils and Others in Ten Graduating Classes Combined 


Honor Graduates Elected to Alpha Pi 
Total алла SET а y 
E М Boys Girls Total Boys Girls Total 
Underage, 
Others screened 550 18.8 25.4 22.7 12.9 18.7 16.4 
Differe, 3,891 8.4 > 15.04) ALOT 8 8,7. "58. 
s 10.4 10.4 «1058 FL ООБА 


9 

8 

16 

* 

in the Lore were 224 boys and 326 ted (of 1,165 originally admitted to kindergarten 
[ow Underage, test-screened e ed s made up of 1,863 boys and 2, 


8 


162 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


some cities near Brookline, we have by grade six as many underage 
children who have moved into town after grade one as we have who 
were originally admitted by test to kindergarten, and by grade nine 
we have twice as many. The study reported in 1948 showed that 
these children as a group were in every grade (2 to 8 inclusive) more 
successful scholastically than the older children in the class but less 
successful than the underage children originally selected by test. In 
this study these children are in every case included in the “other” 
classmates with whom the underage accelerates are compared. 

The data in Table 1 show that the underage boys and girls ex- 
ceeded their older fellows in the percentage who graduated with 
honor by a margin statistically significant beyond the one per cent 
level of confidence. 

For election to Alpha Pi, at least one-third of the points presented 
as evidence of eligibility for membership must come from participa- 
tion in extra-curricular activities. The data summarized in Table 1 
show that the percentage of underage boys and girls gaining election 
to Alpha Pi exceeded that of the other boys and girls by a substantial 
margin, again yielding a difference significant beyond the one рег 
cent level of confidence. 

These data indicate that underage children originally admitted to 
kindergarten on the basis of psychological and physical examina- 
tions are certainly not at a disadvantage during their high school 
years so far as honors and distinctions at graduation are concerned. 
Both eriteria indicate the previously reported superiority of the ex- 
perimental group holds throughout the publie school years of these 
pupils, 


Comparative Scholastic Performance of Underage and Other 
Graduates in Two Classes 


Since the elementary school records of the classes which graduated 
in 1946 and 1947 had been analyzed in some detail in the 1948 study, 
16 seemed fitting to choose these two classes as the ones to be studied 
in detail through their high school years. The criterion of academic 
success was GPA in the 16 standard high school courses. Table 2 
presents these analyses. 

Tn order to present the comparative scholastic performance of the 
underage and other graduates by means of a single index, marks 
were translated into a point rating scale used in many secondary 


JAMES В. HOBSON 163 


TABLE 2 - 
y Four-Year Academic Point Ratings of Underage, Test-Screened 
„ Graduates and Others in Two Classes _ 
Class of 1946 ] Class of 1947 


Total | —————— —— ——— Total  —————————— 
№ Boys Girls Total № Boys Girls Total 


39 2.43 2.52 2.49 52  :2.53 2.73 2.64 
386 2.22 2.33 2.29 388 2.18 2.38 2.28 
2 DP .20 .35  .35 .36 


E The class of 1946 was made up of 12 boys and 27 girls in the experimental group; 139 boys 
and 197 girls in the control group. In the class of 1947 there were 20 boys and 32 girls in the former, 
1 97 boys and 191 girls in the latter groups. 


ols and colleges in which an “A” counts 4 points, a “B” 3, a 
2, and a “D” 1 point. The differences in the Class of 1947 are 
all significant at about the one per cent level of confidence and be- 

d. Those in the Class of 1946 will permit rejection of the null 
othesis at somewhere between the 5 per cent to 10 per cent level 
nfidence, chiefly because of the small number of underage boys 
In the Class of 1946. If the comparison is based upon the separate 
ourse marks received by the two groups during the four high-school 
S, the differences are all significant beyond the one per cent - 
vel of confidence. 


E -Curricular Activity Participation of Underage and Other 
Graduates in Two Classes 


Stated previously, the second main purpose of the present 
y was to investigate and informally analyze the comparative 
Zh school extra-curricular activity participation of two groups. 
nce the activity record of each graduate is published with his pic- 
in the yearbook and since this record is checked by the year- 
taff and reviewed by the faculty sponsors, the data published 
be considered more than ordinarily reliable. The time and op- 
mity for participation in extra-curricular activities may be 
ed by out-of-school hours employment. Such employment is 
d and counted as an extra-curricular activity. 

„the data on extra-curricular activity participation of the Classes 
946 and 1947 are shown in Table 3. 

he data in Table 3 show that the underage boys and girls of both 
Ses exceed their classmates of the same sex by a substantial 


164 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
i ' TABLE 3 
Extra-Curricular Activity Participation of Underage (ABT )* and 
Other Graduates in the Classes of 1946 and 1947 


Underage | 4 Other 

ко; т. ша 

Воув ' * . Girls Boys Girls 
Баа 
Numberof , Number of > Number of Number of 
Activities ' Activities Activities Activities 
LM E a ВНЕ. 
Year Total Mean Total Mean ‘Total Mean ‘Total Mean 


Class of 1946 188 11.58 583 19.33 1145 8.79 3023 14.85 
Classof 1947 222 11.50 826 26.22 1841 8.55 3228 15.62 
Both Years 

Combined 360 11.53 1409 22.89 2986 8.69 6251 15.24 


S.D. 7.43 14.13 7.83 11.33 ` 
Diff. 2.84 7.65 
Both Sexes 


*ABT = Admitted by test. 


margin in average number of extra-curricular activities engaged in 
over the four-year period. Both the underage and other girls ex- 
ceeded both underage and other boys both years as well. The most 
significant fact for the purpose of this study was the revelation that 
the underage boys and girls of both classes taken as one group had 
an average of 18.8 extra-curricular activities compared to an average 
of 12.1 activities per student among the other graduates in the two 
classes—a ratio of more than three to two. The underage boys eX 
ceeded the other boys by a difference that is significant at about the 
2 per cent level of confidence. The wide margin by which underage 
girls exceeded the other girls is significant far beyond the 1 per cent 
level of confidence. 

Apparently the accelerated status and youth of the underage 
children originally admitted by test was no handicap to them in 
extra-curricular activity participation. Judging from the data O 
election to Alpha Pi, it appeared that they were able to achieve 
more than their share of success and prominence in these activities. 

A detailed analysis of the kinds of extra-curricular activities е0" 
gaged in by the underage and other graduates would make ап inter- 
esting study in itself, but that is too lengthy an undertaking to be 
included in this paper. An informal analysis of the activities €T" 
gaged in by the two groups of graduates shows no discernible differ- 


JAMES R. HOBSON = >. 1 


ences in the kinds of activities undertaken by the underage and 
other girls. While 126 of the 224 underage boys took part in athletics, 

it appears from a subjective analysis of all ten classes of graduates 

that the underage boys seldom achieved eminence in the contact, 
sports. 

In any event the data in Tables 1 ‘and 3 will support the general- 
ization that the underage boys and girls were more universally ac- 
tive and successful in extra-curricular activity participations and 
that a larger share of them achieved distinction in these activities 
than was true of their classmates. 


Post-Secondary School Admissions of Underage Graduates 
(ABT) in the Classes of 1946 and 1947 Compared with 
Other Graduates 


The data in regard to college admissions for the classes of 1946 
and 1947 are shown in Table 4, broken down into regular four-year 
colleges which are fully accredited and other advanced institutions 
which include junior colleges, business and other specialized schools, 
and some of the newer colleges which have not as yet qualified for 
full accreditation. 

The data in Table 4 show that a significantly larger percentage 
of underage boys and girls went on to post-secondary education. If 
only four-year accredited colleges are considered, the margin is 
even greater, 22.6% more ABT boys and 21% more ABT girls went 


TABLE 4 
Admission to Post-Secondary Schools of Underage (ABT) and 
Other Graduates of the Classes of 1948 and 1947 
SU! a E 


BOYS 

ИНЕ eee ere Cees 

Underage (ABT) Other 
4-Year Other 4-Year Other 
Colleges Inst. Colleges Inst. 
DM. En N % N «X Eo. N 39 C 19M 
og 12 б 00 1 88 140.36 25.7 260088 
20 13 050 2 100 216 95 44.0 26 12.0 
"a 32 19 54 з 9.4 356 131 36 52 м 


* 4 ABT ; 
bef Vii е baye ofthe Gn f 108 кытый the Алей Free а the US, ону 
RE immediately after graduation. These percentages are 33.3 and 32.1, respectively. 


166 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
TABLE 4 (Continued) 


с—————— 


GIRLS 
کڪ‎ 
Underage (ABT) Other 
i pL وق‎ en 
4-Year Other 4-Year Other 
Colleges » Inst. Colleges Inst. 
Clas En N* Ф “N % En. N % N % 
1946* 30 15 50.0 6 20.0 205 66 32.2 65 33.2 
1947 82 19 59.4 10 31.3 203 72 35.5 59 29.1 


"Total 62 34 548 16 25.8 408 138 33. 127 31.1 
Diff. 21.0 


MUNIRI кыы _ _ — —— 


оп to such colleges than was true of their fellows. It may also be 
noted that, despite the entrance into the armed forces of about one- 
third of both the ABT and other boys in the Class of 1946, for the 
two-year period the boys in each category exceeded the girls in 
percentage gaining admission to four-year colleges. If 1947 is taken 
as a more normal year, the margin is sizable. This is undoubtedly а 
reflection of the greater tendency of boys to prepare for entrance 
into the professions and the greater tendency for girls to prepare for 
business, homemaking, and shorter term occupational specialties. 

The percentage of both ABT and other girls substantially ех- 
ceeded that of boys in their category in the matter of total post- 
secondary school attendance. However, the ABT boys exceeded the 
older girls for the two-year period despite the loss to the armed 
forces in 1946. The data presented in Table 4 will amply support the 
generalization that a larger percentage of underage accelerates in 
the Classes of 1946 and 1947 gained admission to first-class colleges 
for the purpose of continuing their education than was true of their 
fellows. 


A Necessary Major Assumption 


This has been an objective report of factual material recorded on 
permanent records in the archives of Brookline High School. In at 
riving at the conclusions summarized below it has been necessary to 
make only one major assumption, namely, that no selective factor 
affecting the results of this research is involved in the moving away 
from Brookline of more than half of the underage children originally 
admitted by test before their graduation. The data in the 1 


JAMES R. HOBSON 167 


udy showed an average of 117 underage pupils per year admitted 

kindergarten by tests and 115 per year completing grade one. The 
question is—are the talents and other personality traits of the 55 

year who remained to graduate approximately the same on the 
verage as those of the 60 who have departed. Brookline is an ex- 
ensive town in which to own property. Consequently, a great deal 
‘of the moving is occasioned by successful people in business or the 
professions buying homes in Newton, Needham, or Wellesley, 
Massachusetts, and moving from their rented apartments in Brook- 
ine. The children of successful people moving to homes of their own 
hould compare favorably in the traits that make for success in 
cho ol as compared to the children of families which stay put. While 
übjectively one might think that families whose children are par- 
P arly happy and successful in school might tend to avoid moving 


proximately 60% of what it had been during the last four years of 
mentary school. The Class of 1946 lost twelve underage accelerates 
1 grades 5 to 8 inclusive and fourteen in grades 9 to 12 inclusive, 
While in the Class of 1947 the figures were twenty-five and eight, 
respectively. 


Conclusions 


The following conclusions appear to be supported by the data de- 
tailed in the preceding sections of this report. 


L The scholastic superiority in elementary school of underage chil- 
dren, originally admitted to school on the basis of physical and 
Psychological examinations, is continued and somewhat increased 
through high school. This conclusion is supported by the statis- 
ically significant margin by which both boys and girls in the 
underage (ABT) groups achieved higher GPA's and by the per- 
, Cntage graduated with honor. 

^ Underage accelerates (ABT) engaged in a significantly larger 
Average number of extra-curricular activities over the four-year 
Period. Their activity participation was not overly weighted with 
Activities of a scholastic nature. Athletic and social honors and 


lective Positions came in for their full share of underage par- 
ticipation, 


168 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


3. In the matter of honors, awards, and distinctions at graduation 
the underage (ABT) boys and girls exceeded their fellows by a 
ratio of about two to one. 

4. A significantly larger percentage of underage (ABT) graduates 
sought and gained admission to accredited four-year colleges of 
superior standing than was true of their classmates of the same 
sex. 

5. Initial acceleration, of children who are within a few months of 
the usual minimum age for admission to school and who can 
demonstrate, in physical and psychological examinations, physical 
fitness and mental maturity which will insure their being under 
no serious initial handicap as compared with the average of their 
older classmates, is the ideal means of making initial provision 
for individual differences. It avoids the break in the continuity 
of the educative process which is inherent in any system of grade 
skipping or double promotion after a child has attended school. 
Tt is а step in providing for gifted children and saves а year for 
the candidate for the professions who faces a long period of gradu- 
ate and post-graduate study after college. 


REFERENCES 
Ammons, M. P. and Goodlad, J. I. “When to Begin: Dimensions of 
the First Grade Entrance Age Problem." Childhood Education 
XXXII (1955), 21-26. 
Anderson, E. E. (Editor) Research on the Academically Talented 
Student. Washington, D. C.: National Education Association, 


1961. 

Beall, Ross H. and Holmes, Mossie. “Identifying Mature and Im- 
mature First-Year Entrants.” Newer Practices in Reading in the 
Elementary School. Bulletin of the Department of Elementary 
d Principals, National Edueation Association, July 1938, 

Birch, J. W. *Early School Admission for Mentally Advanced Chil- 
dren.” Exceptional Children, XXI (1954), 84-87. 

Carter, L. “The Effect of Early School Entrance on the Scholastic 
Achievement of Elementary School Children in the Austin Pub- 
lic Schools.” Journal of Educational Research, L (1956), 91-108. 

Cone, Herbert В. “Brookline Admits Them Early." The Nations 
Schools, LV (3), (1955), 4647. j 

Dwyer, P. S. “Correlation Between Age at Entrance and Success Y 
о Journal of E Psychology, XXX (1939), 

Edmiston, R. W. and Holohan, C. E. “Measures Predictive of Firg 
лн Achievement.” School and Society, LXIII (1946), % 


JAMES В. HOBSON 169 


orester, J. J. “At What Age Should A Child Start School?” School 
Executive, LXXIV (1955), 80-81. 

Gates, Arthur I. “The Necessary Mental Age for Beginning Read- 

ing.” Elementary School Journal, XXXVII. (1937) , 497-498. 

lles, Herbert M. and Coulson, Marion C. “At What Age Is A 

- Child Ready for School?” School Executive, LXXVIII (1959), 


29-31. 

Gray, William 8. (Editor) Reading in General Education. Wash- 
ington, D. C.: American Council on Education, 1940. 

Handy, A. E. “Admission of Underage Pupils.” American School 
Board Journal, LXXXIII (2) (1931), 46. К 

Handy, A. E. “Are Underage Children Successes in School?" Ameri- 
can School Board Journal, XCVII (4) (1938), 31-32. UE: 

E M. Lucile, Reading Readiness. Boston: Houghton Mifflin, 

Hausman, E. J. “Ready for First Grade?" School Executive, LIX 

_ (2), (1940), 25-26. | 

Hildreth, С. Н. “Age Standards for First Grade Entrance." Child- 
hood Education, XXIII (1946), 22-27. 

Hobson, James R. *Mental Age As A Workable Criterion for School 

pres" Elementary School Journal, XLVIII (1948), 312- - 

Kazienko, L. W. “Beginner Grade Influence on School Progress.” 

р ош "Administration and Supervision, XL (1954), 219- 

Keys, N. Underage Student in High School and College. Berkeley: 

_ University of California Press, 1938. 

g, Inez, B. “Effect of Age of Entrance into Grade One Upon 

Achievement in Elementary School.” Elementary School Journal, 

.. LV (1955), 331-336. 

Knight, J. and Manuel, H. T. *Age of School Entrance and Subse- 

quent School Record." School and Society, XXXII (1930), 24-26. 

Lehman, H. ©. Age and Achievement. Princeton: Princeton Uni- 

versity Press, 1953. А 

Manwiller, C. E. “Follow-up of Pupils Tested for Placement in 

Grade One Before the Chronological Age of Six.” Pittsburgh 

Schools, No. 10 (1936) , 86-95. y 1 

Шег, Vera У, “Academic Achievement and Social Adjustment of 

Children Young for Their Grade Placement." Elementary School 
Journal, LVII (1957), 257-263. 

Monderer, J. H. “An Evaluation of the Nebraska Program of Early 
Entrance to Elementary School.” Unpublished Ph.D. Thesis, 
University of Nebraska, 1953. Abstract: Dissertation Abstracts 

4 XIV:633; (1954, Part 1). 

emzek, С. L. and Finch, F. H. “Relationship Between Age at En- 
trance to Elementary School and Achievement in the Secondary 

P School.” School and Society, XLIX (1939) , 778-779. 

artington, Н. M., “Relation Between First-Grade Entrance Age 
and Success in the First Six Grades.” National Elementary Prin- 
cipal, XVI (1937), 298-302. 


170 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Patterson, Н. “Chronological Age of Highly Intelligent Freshmen." 
Peabody Journal of Education, XII (1934), 19-20. 

Pauly, Frank В. “Sex Differences and Legal Schools Entrance Age." 
Journal of Educational Research, XLV (1951), 1-9. 

Shannon, Dan C. *What Research Says About Acceleration." Phi 
Delta Kappan, XXII (1957) , 70-72. 

Smith, J. ‘The Success of Some Young Children in the Lincoln, 
Nebraska Public Schools.” Unpublished Master's thesis, Univer- 
sity of Nebraska, 1951. 

Terman, L. M. ‘The Discovery and Encouragement of Exceptional 
Talent." American Psychologist, IX (1954) , 221-230. 

Washburne, Carleton. Introduction to Child Development and the 
Curriculum, Part 1. Bloomington, Ill., Public School Publishing 
Company, 1939. 

Wilson, F. T. “Educators’ Opinions About the Acceleration of Gifted 
Students.” School and Society, LX XX. (1954), 120-122. 

Worcester, D. A. The Education of Children of Above Average 
Mentality. Lincoln: University of Nebraska Press, 1956. 

Wright, Grace S. “Permissive School Entrance Ages in Local School 
Systems.” School Life, XVIII (1946) , 20. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vou. XXIII, No. 1, 1963 


RATER PERSEVERATION IN MEASUREMENT 
OF PATIENT CHANGE! 


MARY HELEN MICHAUX, KAY Y. OTA, THOMAS E. HANLON, 
AND ALBERT A. KURLAND 


Spring Grove State Hospital? 


Srunies aimed at evaluating change in psychiatric patients under- 
going treatment have shown similar trends in degree of pathology 
assessed over a series of evaluations. Investigations of drug therapy 
(Kurland, e¢ al., 1960) and of group and individual psychotherapy 
(Frank, e£ al., 1959), show similar patterns of patient change, as 
measured by criterion instruments. In each case there was an initial 
drop in pathology, between pretreatment and first posttreatment 
evaluations, followed by relatively little change over subsequent 
Tating intervals, Meehl (1959) observed that therapists’ successive 
Q-sort evaluations of their patients stabilized very rapidly. The 
therapists Q-described their patients after the first therapeutic hour, 
‘gain after the second, fourth, eighth, sixteenth, and twenty-fourth 
contact. By the end of the second or fourth hour, the correlation 
Coefficients with subsequent hours approached sort-resort reliabilities. 
Mech] commented that “the extent to which this rapid convergence 
to a stable perception represents invalid premature ‘freezing’ is un- 

own.” Dailey’s (1952) study was cited as pertinent. Dailey found 
that clinicians tended to form stereotypes of patients on the basis 
of partial information which resisted change, even though further 


p his paper is based on one part of the results of Research Project Grant 
mo MY 4239 of the National Advisory Mental Health Council, U. S. Publie 

Tov, Service, administered by Friends of Psychiatric Research, Inc., Spring 

3 М tate Hospital, Baltimore 28, Maryland. bah 
intende: authors wish to express their gratitude to Dr. Bruno Radauskas, qum 
research. t, for making available facilities of the hospital for carrying out this 


171 


172 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


information contained contradictory evidence. Sines (1959) found 
that “clinicians changed their impressions of a patient less and less 
with further inerements to the available data." He felt that the 
rapid crystallization of clinicians’ perceptions of patients suggested 
that they were interpreting their early inferences as facts. 

For purposes of discussion, the pattern of measured change de- 
scribed above will be termed the “plateau” effect. Since this phe- 
nomenon appears to be independent of criterion instrument, type of 
treatment, or time interval between measurements, an artifact 
should be suspected. It was the general purpose of this research to 
determine when and to what extent “pigeonholing” of chemotherapy 
patients by clinicians occurs early in a series of evaluative inter- 
views, in such a way that further changes under treatment are not 
perceived. The general hypothesis to be studied may be stated as 
follows: The “plateau” effect observed in studies of patient change 
under treatment is in part a function of perseveration of rater opin- 
ton based on previous contacts with the patient. From this hy- 
pothesis, it was predicted that repeated interview observations of 
chemotherapy patients by the same observer would at some point 
fail to register fully further favorable change. This “hidden” change 
could, we predicted, be detected by comparing, for each interview 
in the series, behavioral scale ratings of patients’ degree of illness 
made by a continuing observer with those of an observer who was 
seeing the same patients for the first time. 


Method 
Subjects 


The subjects were 90 state hospital patients referred for treatment 
with one of eight phenothiazine compounds: triflupromazine, chlor- 
promazine, trifluoperazine, prochlorperazine, perphenazine, fluphen- 
azine, thiopropazate, and thioridazine. Criteria of selection specified 
newly admitted patients between the ages of 18 and 60 years. Alco- 
holies, court orders, chronic brain syndromes, major organic diseases, 
drug addicts, and seniles were excluded. The term alcoholic was aP- 
plied to patients with acute brain syndromes and those who, in the 
opinion of their doctors, showed primary rather than symptomatic 
alcoholism as a presenting problem. The median age of the subjects 
was 36 years, the range from 18 to 59 years. The ratio of men to 
women was two to three (36 males, 54 females). 


MARY HELEN MICHAUX, ЕТ AL. 173 


"Measures 


The Multidimensional Scale for Rating Psychiatric Patients 
XLorr, 1953) was one of two devices used to evaluate patient change. 
"This instrument had previously proved effective in measuring differ- 
effects among six phenothiazine compounds and between ac- 
‘tive drugs and positive and inert placebos in short-term treatment 
Of acutely ill psychiatrie patients (Kurland, её al., 1961). A revised 
orm of the MSRPP, the Inpatient Multidimensional Psychiatric 
Beale (Lorr & McNair, 1960), was used in connection with the 
MSRPP. In the interest of clarity, only the MSRPP total morbidity 
Scores of the interview section will be dealt with in this report. 


Procedure 


The 90 chemotherapy patients were assigned randomly to three 
Groups of 30 patients each. They were interviewed and rated on 
criterion instruments prior to treatment and on the fifth, fifteenth, 
and thirtieth days of medication, as indicated in Table 1. 
‘Two raters participated in each interview: a “constant” rater 
who saw the patient for all four interviews and an “alternate” rater, 
Whose schedule of contacts with patients is indicated in Table 1. 
A; refers to the alternate who saw the patient initially and Аз to his 
eplacement, considered a fresh rater at point of replacement. In 
ups I, II, and III, alternate rater A, was replaced by alternate 
Tater A; at the second, third, and fourth interviews, respectively. 
Evaluations were made on the basis of a one-hour interview with 
patient in which the two raters participated jointly. 


TABLE 1 
Rating Schedule 

Occasion 

1 2 3 4 
Group Initial Day 5 Day 15 Day 30 

Ay As As Аз 
C с с с 
Ai Ai Аз Аз 
C (9) С С 
ш Ai Ai Ad А, 
Cc с С С 


Ba = beginning alternate rater; Аз = second alternate rater; C = continuing or constant 


174 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


'Three clinical psychologists were responsible for all ratings. The 
average intraclass correlation for the MSRPP total morbidity score 
(interview section) for the three psychologists had previously been 
found to be .80 for a sample of 20 patients. Prior to criterion data 
collection, the three participating raters saw 10 patients together. 
Each patient was discussed jointly, and various items in assess- 
ment instruments clarified. Several “ground rules" were set forth 
in a formalized rating guide for the project which was placed in the 
hands of each rater. Typical stipulations were as follows: 


(1) Do not diseuss patients at all. (2) Constant rater leads all 
interviews. Alternate may question patient about specific mat- 
lers after eonstant rater has finished. (3) MSRPP Item 2 
(bizarre or stiff posture). Rate drug-induced postures also. (Since 
an inference would have been involved in attributing certain 
postures to drug effects, raters agreed to rate postures as ob- 
served, whatever opinion they might hold with respect to their 
origin.) (4) MSRPP Item 21 (somatic preoccupation). Rate on 
preoccupation, regardless of physical condition. (“What if the 
patient is, in fact, physically ill?”) Include food faddists, physi- 
cal culture faddists, ete. (Fairly frequently patients showed pre- 
occupation with physique and “body building.” The question 
arose, “Is this somatic preoccupation ?") 


The above sample “guidelines” are given to indicate the way in 
which participating raters discussed instruments until there was 8 
meeting of minds with respect to each item. After actual data collec- 
tion began, discussion was forbidden, either of patients or of scale 
items. Pearson product moment correlations based on the first joint 


interviews of the study for pairs of raters were as follows (№ = 30 _ 


for each correlation): Raters X and Y, .89; raters X and 7, .89; 
raters Y and Z, .77. 

Since all patients were receiving active phenothiazine compounds 
in a double-blind arrangement, it was impossible to control the type 
of treatment each patient received. Ratings by a constant rater and 
one alternate rater at each evaluation point were secured in 8n 
effort to control effects of possible differences in time required for 
the various drugs to produce behavioral effects in patients. In order 
to eliminate effects of individual rater idiosyncrasies, each of the 
raters, X, Y, and Z, served as constant or alternate rater, according 


MARY HELEN MICHAUX, ET AL. 175 


ТАВГЕ 2 
Individual Rater Balance, Illustrated for Pattern I 


Occasion and Rater 
Group 1 2 3 4 
Ia 5 patients X(C) X(C) X(C) X(C) 
Z(A1) Y(A2) Y(A2) Y(A2) 
Ia’ 5 patients X(C) X(C) X(C) X(C) 
Y(A1) Z(A2) Z(A2) Z(A2) 
Ib 5 patients Y(C) Y(C) Ү(С) Y(C) 
Z(A1) X(A2) X(A2) X(A2) 
Ib’ 5 patients Y(O) Y(O) Ү(С) Ү(С) 
X(A1) Z(A2) Z(A2) Z(A2) 
Te 5 patients Z(C) Z(C) Z(C) XC) 
Х(А1) Y(A2) Y(A2) Y(A2) 


Ie’ 5 patients uC Z(C) Z(C) 20) 


to а pattern illustrated for Group I in Table 2. Groups II and Ш 
were similarly balanced with respect to individual raters. 

The arrangement shown in Table 2 permitted comparisons to be 
made at the second evaluation point between the ratings of X, Y, 
and Z as constant raters (having seen the patient for one previous 
interview) and their ratings as Alternate 2, or new rater. That is to 
say, the 30 patients of Group I were seen by raters X, Y, and Z 
(10 patients per rater) as continuing or “constant” rater; parallel 
alternate ratings for these patients were also divided evenly among 
the same three raters. Similar comparisons for Groups П and III 
сап be made at the 15th and 30th day, respectively. 

It was felt that, in order to hold the interview situation as nearly 
constant as possible for the patient, two raters should see subjects 
for each of the four interviews. Though interest was in comparison 
of A's first contact ratings with those of the constant rater, А1 was 
introduced in a sense superfluously for the purpose of minimizing 
the effect on the patient of the presence of a second person in the 
interview at the point where old and new rater evaluations were to 
be compared. It was impossible to control whatever changes might 
occur in the patient’s behavior in response to the participation of a 
new alternate rater. It was hoped to reduce such changes by accus- 
toming the patient to two interviewers. By the same token, A»'s 
continued participation (after his first contact) in the series of in- 
terviews in Groups I and П was designed merely to keep the situa- 
tion for both patient and continuing rater as nearly constant as pos- 


176 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


sible. In fact, then, all alternate raters except А» at his first coni 
with the patient could be said to be dummies, included in the des 
to minimize changes in the interview situation. a 


Results 


A comparison of ratings by the continuing rater and ra 
As, based on his first contact, with the patient, for Groups I, 
III is given in Table 3. Figure 1 presents two composite picture 
patient change in treatment, one for each set of raters, derived fr 
the data of Table 3 with the addition at the left of pretreat 
baseline mean ratings of all 90 patients by constant and first ali 
nate raters (r — .81; t — .96, not significant). Ж 

The hypothesis of the study was supported by the findin 
clinicians who had never seen a patient before rated him 
cantly lower in total morbidity after 30 days of treatment 
clinicians who had seen the patient for three previous inte 

(t = 1.90, p < .05, one-tailed test). New raters rated в 


Au em 


——— CONSTANT RATERS 
77-— NEW ALTERNATE RATERS 


15 
РВЕ- 
TREATMENT 


Figure 1. Constant Raters vs. New Alternate Raters: Four Oc 
Note. Pretreatment means are based on all 90 subjects. At the 5th, 


30th days of treatment, means are based on Groups I, II, and III, 
(N — 30 in each case). 


DAY 5 DAY 15 


MARY HELEN MICHAUX, ET AL. 177 


TABLE 3 
Old Raters vs. New Raters—M SRPP Interview Total Morbidity Score 


Raters 5th Day Interview 15th Day Interview 30th Day Interview 
Group I (N = 30) Group II (N = 30) Group III (N = 30) 
t 


Mean r t Mean r t Mean r 
Old 23.90 22.30 20.20 
New 24.97 22.47 18.17 
.80 0.85 .82 0.14 .85 1.90* 


* Sig. « .05 level, one-tailed test. 


did constant raters. Mean ratings of constant and alternate raters 
converged at the 15th day evaluation point, showing a difference of 
only .17 in morbidity scores (Table 3). 

Initial mean total morbidity scores for the three patient groups 
were as follows (constant raters' scores given first in each case): 
Group I, 32.40, 31.17; Group П, 34.00, 33.67; Group ПІ, 31.17, 30.63. 
No significant differences were found among the initial mean total 
morbidity scores of the three groups. 

Discussion 

Results of the study indicate that clinicians’ ratings of chemo- 
therapy patients based on successive interviews reach a point in the 
series of evaluations at which they are measurably colored by previ- 
ous contacts with the subjects. After 30 days of drug treatment, at 
the fourth interview by constant raters, patients were rated sicker 
by these raters than by new alternate raters. This finding would 
seem to have methodological implications, not only for studies of 
treatment eff ects, but also for other types of experiments or practical 
Situations in which change is assessed by repeated evaluations of 
subjects by a single individual. 

Any rating instrument must by necessity measure both rater and 
tatee behavior. Stanley (1961) summarized a number of studies 
dealing with errors of measurement due to human components in 
the rater-scale metric. Among forms of rater bias that have been 
Tather thoroughly investigated were the following: (1) A halo effect 
extending over ratings of several traits within one subject; (2) Con- 
Sistent differences among individual raters in levels of ratings; and 
(3) Responses by raters on a purely subjective basis to particular 
Patients or particular traits. Stanley noted that sequence might be 
Considered a fourth effect (in addition to raters, ratees, and scales) 


178 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


in successive ratings, but cited little previous work toward isolating 
the effects of this variable. The whole area of possible operation of 
bias in rater perception of change in ratee over a series of interviews 
seems to have been neglected. 

Since the focus of interest in research dealing with mental illness 
is so frequently evaluation of treatment, refinement of techniques 
for measuring patient change is of particular importance. Psychiatrie 
rating scales represent a great advance in the direction of quantify- 
ing human behavior for scientific study. The stability of the MSRPP 
interview morbidity scale is indicated by the degree and consistency 
of agreement of raters in this study as to severity of illness of pa- 
tients relative to each other. The MSRPP can be considered typical 
of a number of carefully constructed, well-validated clinical rating 
scales. It is possible that such instruments may not fully register 
favorable change in patients because of bias in scores introduced by 
premature freezing of clinical opinion. In the interest of more ac- 
curate assessment of treatment effects, further investigation is needed 
into the effects of previous contacts with patients on clinicians’ judg- 
ments of severity of illness. If the phenomenon of rater perseveration 
observed to a slight degree in the present research occurs generally 
in studies involving measurement of change, then designs of such 
experiments should be modified so as to avoid successive observations 
by one rater of one ratee, or appropriate statistical corrections 
should be employed. 

The present authors feel that the extent of error in ratings due to 
previous rater-subject contact is greater in reality than results here 
would indicate. The comparative curves of change presented in Fig- 
ure 1 suggest that the difference between constant and new rater 
mean evaluations may have been a function not only of rater com 
tacts but of another variable (unforeseen in the experimental de- 
sign) which operated to produce convergence of means of the tw? 
sets of raters at the 15th day. That is, the difference between con- 
stant rater and new rater mean evaluations may have been the p^ 
sult of two forces which tended to cancel each other out at the thi 
interview: Constant raters may have started to "plateau," 8$ hy- 
pothesized, and new raters may have tended to exaggerate pathology: 
Several aspects of the data, each of which in its own right ca? b. 
given only slight corroborative weight, combine to suggest the latter 
assumption. At the 5th day evaluation of Group I, when the “pla 


MARY HELEN MICHAUX, ET AL. 179 
35 


N A2 ALTERNATE RATERS, 
W S FIRST CONTACT 


ALTERNATE RATERS, 
SECOND CONTACT BRE 


چ 
ы:‏ 
- 
- 
"SS‏ 


PRE- DAY 5 DAY 15 DAY 30 
TREATMENT 


Figure 2. Constant уз. Alternate Raters, Group I: Four Occasions 


S saw patients as somewhat sicker than did constant raters. 
` Figure 2, showing constant and alternate ratings of Group I pa- 
ients at each of the four evaluations, illustrates a subsequent 
“crossover” of constant and alternate mean ratings between Ao’s 
first and second contact with the patient. It will be noted that, at the 
15th day of treatment, alternate ratings of Group I patients had 
рреа below constant ratings, though constant and new alternate 
Tatings of Group II patients (Figure 1) after 15 days of treatment 
Were almost identical. There was a suggestion in the data that al- 
ate raters tended to see more change in patients between their 
first and second contacts than did constant raters during the same 
Period, partly because of higher first contact ratings and partly be- 
Cause of lower second contact ratings. Though differences in change 
Scores between the two groups of raters for such periods did not 
Teach statistical significance, they were large enough to be sug- 
Festive (Group I, 5th to 15th day, 7 = .57, t = 1.51; Group II, 15th 
x h day, r = .59, t = 1.28; t required for significance at the .05 

leve › one-tailed test, = 1.83). 

‚ Tb is conceivable that new raters tended to rate high (in terms of 
eir habitual rating behavior), possibly because their ratings did 


180 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


not reflect response to contrast between the way the patient was 
before treatment and his behavior after drug therapy. (Or perhaps 
the new raters were looking for pathology and tended to “accentuate 
the negative.”) If constant raters were tending to plateau at the 
15th day and new alternates’ ratings were elevated by one of the 
above considerations, morbidity scores for the two sets of raters 
would tend to converge. If the same forces were operating at the 30th 
day evaluation of treatment, then the extent to which the constant 
rater was “plateauing” would tend to be masked. 

The above assumed opposing forces notwithstanding, alternate 
ratings of patients were significantly lower than constant ratings af- 
ter 30 days of treatment. With reference to convergence of constant 
and alternate scores, it will be noted that the two sets of ratings 
“crossed” at the third contact of the regular rater, and that alternate | 
ratings continued downward (Figure 1). It is interesting to speculate 
as to what the relationship of the means of the two groups of raters 
would be at the fifth, or sixth, or tenth patient contact. 

Experimental psychologists have long been concerned with iso- 
lating the separate effects of maturation and learning (or practice) 
on the performance of a variety of tasks. In the frequently | 
pretest-treatment-posttest research design, previous experience of à 
cognitive task has been found to influence subsequent performance 
in unexpected and seemingly anomalous ways. Solomon (1949) re- 
ported that a pretest of spelling had a depressing effect on posttreat- 
ment scores of school children (treatment consisting of training in 
spelling). This phenomenon was attributed to perseveration of er- 
rors made on the pretest by subjects. Lana and King (1960) found 
that a pretest which consisted of recall of a story acted to sensitize 
college students so as to produce a differential posttest response from 
individuals not exposed to the pretest. The authors concluded 
learning which took place in the pretest probably accounted for 
differential posttest performance in the pretested group of subjects: 

It seems reasonable to suppose that judgments of patients made 

by clinicians may not only reflect “practice” of the individual rater 
in the use of a given instrument, but also “practice” with respect t0 ® 
particular patient. The literature of measurement contains & number 
of references to a “plateau” effect observed in repeated evaluations 
by the same rater of individual patients (Meehl, 1959; Sines, 1959). 
Stanley (1961) suggested that sequence might be considered а fo 


MARY HELEN MICHAUX, ET AL. ` 181 


effect in analyzing scores obtained by having each rater rate each 
ratee-trait combination more than once under randomized conditions 
that minimized memory carry-over. 

Since search of the literature revealed only one study focused 
specifically on the problem of rater stereotypy, that of Dailey in 
1952, the present study can be thought of perhaps as a pilot study 
in an area that is almost entirely unknown territory. Our results 
must be regarded as suggestive rather than conclusive, since & cer- 
tain amount of noise was intrinsic and inevitable in the research 
design. The study would have been logistically impossible or pro- 
hibitively expensive to carry out had not our subjects been simul- 
taneously subjects in a comparative multiple phenothiazine study, 
who had to be seen and evaluated by one rater on four stipulated oc- 
casions. That rater could serve as “constant” rater, and, by contriv- 
ing to include one additional rater at each interview, the hypothesis 
of the present study could be tested. For optimum control of ex- 
traneous effects, treatment should have been the same for all ра- 
tients, and the interval between evaluations kept constant. The 
latter might well have been a distorting factor in results, both with 
respect to rater memory and to amount of change in patients. How- 
ever, as previously noted, the “plateau” phenomenon appeared to 
be independent of type of treatment, measuring instrument, or time 
between evaluations. 

The three raters, X, Ү, and Z, were highly experienced clinicians 
who were sophisticated in the use of rating instruments, having rated 
literally hundreds of patients in connection with studies of drug 
effects. Raters so trained in evaluating patients were necessary in 
order to carry out the present study. However, these particular 
clinicians were also full-time research psychologists, whose curiosity 
about the “plateau” effect lured them into a study of their own be- 
havior in a comparative drug study already in progress. Thus, raters 
Were not blind to the fact that they were participating in an experi- 
ment involving rater behavior. The effect on these clinicians of 
knowledge of the research design cannot be estimated, but it was 
undoubtedly present. They adhered rigorously to the rule that pa- 
tients were not to be discussed at all; they did not look at the data 
Until all 90 patients had completed processing in order not to be re- 
minded of the hypothesis of the study or become involved in specula- 
tions about results, Only the principal investigator (Rater Y) kept a 


182 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


roster of patients and raters, detailing the other two raters without 
mention of patient group or rater role. Rigid secrecy was not main- 
tained, but discussion was kept at a minimum. Whatever effect 
knowledge of the experimental design had on raters would have to 
have been at an unconscious level, but probably was sufficiently 
strong to affect results. To borrow a phrase from psychoanalysis, 
“reaction formation” on the part of raters may have introduced noise 
into the study which lowered the probability of finding significant 
results. 

Previous experiments have demonstrated that clinicians fall into 
stereotypes early in their contact with patients and that these pre- 
mature conclusions resist change. Findings of the present research 
suggest that treatment effects measured by repeated rater evalua- 
tions over a series of interviews may be clouded by premature freez- 
ing of clinical opinion. There are a number of further questions to 
be asked regarding the role of rater perseveration in the measured 
outcome of treatment: (1) What is the role of the criterion instru- 
ment in the “plateau” effect? Is it possible that an initial sharp drop 
in the total morbidity score computed from the MSRPP, followed 
by slight change in subsequent ratings, may be due to lack of sensi- 
tivity of the instrument to changes near the “normal” baseline? (2) 
What individual differences are there among clinicians with respect 
to their ability to modify opinions of patients? Is the number of 
contacts with a patient necessary to “freeze” opinion different for each 
clinician? (3) Do some patient characteristics lend themselves more 
readily than others to perseveration on the part of clinicians? (4) 
Are some patients more likely than others to be stereotyped by 
raters? (5) What is the relationship between the pattern of im- 
provement under treatment and the learning curve? Various growth 
or biological change curves? Is the “plateau” final or only tempo- 
rary, as sometimes found in practice effects on performance of com- 
plex tasks? 

The present research was not designed to answer all of the above 
questions, though subsequent data analyses are expected to throw 
some light on differences between rating instruments and among 
clinicians. Perhaps a study could be designed along the lines sug- 
gested by Stanley (1961), which would isolate the variance in clini- 
cians’ ratings due to previous contact with the patient and permit 
examination of a number of first and higher order interactions, 1- 


MARY HELEN MICHAUX, ЕТ AL. 183 


cluding rater-ratee contacts (or trials) with patients, trials-instru- 
ments, trials-clinicians, trials-patients-clinicians, and so on. Since 
repeated ratings are so widely used, not only in psychiatric research 
but in industry and various other areas of applied psychology, it 
“seems important to determine the extent to which artifact (whether 
conceptualized as pretest sensitization, practice effect, or persevera- 
tion in error), arising out of rater-ratee contact, may operate to dis- 
_ tort measurement. 


Summary 


Ninety newly-admitted acutely ill psychiatric patients referred for 
treatment with phenothiazine compounds were randomly assigned 
to three groups of 30 patients each. Each subject was evaluated pre- 
treatment and at the 5th, 15th, and 30th days of medication by two 
clinicians, using the MSRPP interview total morbidity scale. Group 
I was evaluated by a constant (or continuing) rater and simultane- 
ously by a second clinician who was replaced by a new rater at the 
5th day; Groups II and III by the constant rater and a second rater 
replaced at the 15th and 30th days, respectively. Three clinicians 
were responsible for rating all patients; individual raters were 
equally represented in both regular ratings and alternate ratings 
for each group. Ratings by old and new raters were compared at 
each evaluation point. The hypothesis, that the “plateau” effect 
“Noted in repeated evaluations of patients receiving treatment is 
Partly a function of rater perseveration, was supported by the find- 
ing that the mean of new raters was significantly lower than the 
Mean of constant raters at the 30th day evaluation. 


REFERENCES 


Dailey, C. A. “The Effect of Premature Conclusion Upon the Ac- 
quisition of Understanding a Person.” Journal of Psychology, 
XXXIII (1952), 133-152. 

Frank, J. D., Gliedman, L. H., Imber, S. D., Stone, A. R., and Nash, 
E. Н. "Patients! Expectancies and Relearning as Factors Deter- 
mining Improvement in Psychotherapy." American Journal of 

кү Y hiatry, CXV (1959), 961-908. А 

urland, А. A’, Hanlon, Т. Е., and Tatom, Магу Н. “А Compara- 
tive Study of Six Phenothiazine Compounds." In Transactions of 
the Fifth Research Conference on Cooperative Studies in Psy- 
chiatry and Research Approaches to Mental Illness. Washing- 
к: Ме Administration Department of Medicine and Sur- 
ery, р 


184 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Kurland, A. A., Hanlon, T. E., Tatom, Mary H., Ota, Kay Y., and 
Simopoulos, A. M. “The Comparative Effectiveness of Six Pheno- 
thiazine Compounds, Phenobarbital and Inert Placebo in the 
Treatment of Acutely Ill Patients: Global Measures of Severity 
of Illness." Journal of Nervous and Mental Disease, CXXXIII 
(1961), 1-18. : 

Lana, R. E. and King, D. J. “Learning Factors as Determiners of 
Pretest Sensitization.” Journal of Applied Psychology, XLIV 
(1960), 189-191. 

Lorr, M. Multidimensional Scale for Rating Psychiatric Patients I: 
Hospital Form. (VA Technical Bulletin No. 10-507) Washing- 
ton: Veterans Administration, 1953. 

Lorr, M. and McNair, D. “Inpatient Multidimensional Psychiatrie 
Scale.” In Transactions of the Fifth Research Conference on 
Cooperative Studies їп Psychiatry and Research Approaches to 
Mental Illness. Washington: Veterans Administration Depart- 
ment of Medicine and Surgery, 1960. 

МееМ, P. E. “Some Ruminations on the Validation of Clinical Pro- 

‚ cedures.” Canadian Journal of Psychology, XIII (1959) , 102-128. 

Sines, L. К. “The Relative Contribution of Four Kinds of Data to — 
Accuracy in Personality Assessment.” Journal of Consulting 
Psychology, XXIII (1959) , 483-492. ‚Ж 

Solomon, В. *Àn Extension of Control Group Design." Psycholog- 4 
cal Bulletin, XLVI (1949), 137—150. г i 

Stanley, J. С. “Analysis of Unreplicated Three-way Classifications, 
with Applications to Rater Bias and Trait Independence.” Psy- 
chometrika, XXVI (1961), 205-219. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor, XXIII, No. 1, 1963 


ELECTRONIC COMPUTER PROGRAMS AND 
ACCOUNTING MACHINE PROCEDURES 


Edited by 
WILLIAM B. MICHAEL 


University of. California, Santa Barbara 


Generalized Item and Test Analysis Program—A Program for 

the Control Data 1604 Computer. FRANK B. ВАКЕВ...... 
One-Way Analysis of Covariance Fortran Program for the 

ІВМ 7090. ANITA 708 LEWIS ..... ttt 
- FAST Processing of Psychological Tests Using High Speed 
Computing Machines. RUSSELL N. CASSEL AND RENE De La 
BRIANDAIS iol vie ОНОМ 


185 


187 
191 


Іх view of the tremendous advances that have been made in the 
adaptation of electronic computers and accounting machines to the 
processing of statistical data, sections of the Spring and Autumn is- 
sues of EDUCATIONAL AND PSYCHOLOGICAL MEASURE- 
MENT are devoted to the publication of such programs as are 
appropriate to psychometric procedures. Programs relevant to such 
problem areas as factor analysis, item analysis, multiple regression 
procedures, the estimation of the reliability and validity of tests, 
pattern and profile analysis, the analysis of variance and co- 
variance, discriminant analysis, and test scoring will be consid- 
ered. Customarily a program should be expected not to exceed 
six or eight printed pages. Manuscripts of four or fewer printed 
pages are preferred. Each manuscript will be carefully reviewed 8$ 
to its suitability and accuracy of content. In some instances an ac- 
cepted paper may be returned to the author for possible revisions 
or shortening. The cost to the author will be fifteen dollars per page 
for regular running text. The extra cost of the composition of tables 
and formulas will be added to the basic rate. Manuscripts received 
up to November first will be considered for the Spring issue; manu- 
seripts received between then and May first will be considered for 
the Autumn issue. 


АП correspondence should be directed to 
William B. Michael 
Professor of Education and Psychology 
University of California, Santa Barbara 
University, California 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vou. XXIII, No. 1, 1963 


GENERALIZED ITEM AND TEST ANALYSIS PROGRAM— 
A PROGRAM FOR THE CONTROL DATA 1604 COMPUTER 


FRANK B. BAKER 
University of Wisconsin 


Tun development of educational and psychological instruments 
is a lengthy iterative procedure in which the test constructor must 
manipulate many variables in order to achieve the desired outcomes. 
The developer must experiment with item content, item order, the 
scoring scheme, various groups of subjects, and various criterion 
measures, Determination of the effectiveness of these manipulations 
fan to some degree be obtained from the changing values of test and 
item statistics yielded by the instrument. Also, users of commercially 
developed tests need to be able to analyze data obtained from these 
instruments in actual use. The Generalized Item and Test Analysis 
Program (GITAP) is designed to provide the test developer or con- 
sumer with a means of easily analyzing the instrument as a whole 
аз well as each of the items constituting the instrument. The flexible 
modes of analysis and the capacity of the program in terms of 
numbers of items and subjects are such that nearly all existing com- 
mercial instruments can be item-analyzed. 


Analyses Performed 


1. A test score for each individual. 

2. A grouped frequency distribution of the test scores. 

3. Internal consistency reliability of the test by means of Hoyt's 
Analysis of Variance Method (Hoyt, 1941). 

4. Summary statistics of the sample: N, X,8, 3X, 2X7. 

3 е Statistics computed for each of the response choices of the 
item. 


187 


188 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


A. Traditional Indices 
Item difficulty. 

Item-eriterion correlation. Either biserial r or point biserial 
r are available for use. 

B. Parameters of the item characteristic curve (Finney, 1944) 
Ху» the point on the criterion scale corresponding to the 
median of the item characteristic curve. 

B the reciprocal of the standard deviation of the item char- 
acteristic curve. 


Several options are available which greatly increase the flexibility 
of the program. 


1. It is possible to include only selected items in the computation 
of a test score; hence to perform item-analysis using а subscale 
score as the criterion. 

2. An external score can be used as the criterion measure in the 
item analysis. When an external criterion is employed, sections 
1-3 of the preceding section are omitted. 

3. Successive scoring keys can be applied to the data within а 
single computer run. 


Capabilities of and Restrictions to the Program 


Га 


. The total number of possible item choices іп the test must be less 
than 1800. For example a 300-item five-choice test has 1500 
choices. 

2. Successive items need not have the same number of response 
choices although no item may have more than seven possible re- 
sponses. 

3. The scoring scheme consists of integer weights for each item Te 
sponse choice where: 0=w=7. 

4. The maximum number of subjects is 32,767. 

5. The program was written in machine assembly language (AR-2) 

and is completely self-contained. 


Input-Output 


The basic input medium is 80-column punched cards using à 
standardized format (Control Data Corporation CO-OP Manual, 
1961). The response choices made by the subjects to each item, the 


' FRANK B. BAKER 189 


scoring scheme for each item, and parameters describing the sample 
and the test are punched into cards and read off-line to magnetic 
tape via the 160 satellite computer. The results of the analyses are 
printed off-line by the high speed printer via the 160 satellite com- 
puter. Hence, the input-output medium of the actual program is 


magnetic tape. 


| Running Times 


The running times for GITAP have been checked by analyzing 
various test and sample sizes and using the internal real time clock 
to obtain an accurate measurement of the time used. Table 1 sum- 
marizes the analyses performed and the time required. 


TABLE 1 


Running Times for GITAP 
سسس‎ 


Response Choices А 
Items per item. Sample Size Running Time 


Summary 


The Generalized Item and Test Analysis Program was developed 
from sections of an earlier Univac 1103 program (Baker, 1959), tak- 
ing into account the increased memory and superior capabilities of 
the 1604 computer. The only limitation of any serious consequence 
is the requirement that facsimiles of the test papers be punched into 
cards, Optical document readers exist which can directly record the 
item response choices onto magnetic tape. Compatability of data 

pes prepared in this manner has essentially been provided for 
Within the present program. 

The large capacity of the program, the optional modes of analyses, 
pod the rapid processing time should provide the measurement spe- 
talist as well as the practitioner many new avenues of research. 

E ——— 


E. * ipu interested in the detailed users manual for the program should con- 
e Numerical Analysis Laboratory, University of Wisconsin. 


190 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Figure 1. Flow Chart 
GENERALIZED ITEM AND TEST ANALYSIS PROGRAM 


REFERENCES 


Baker, F. B. ^Univae Scientific Computer Program for Test Scoring 
and Hem Analysis." Behaviorial Science, CPA1, IV (1959), 
Control Data Corporation CO-OP Manual, Publications Number - 

67B, July, 1961. 
Finney, D. J. “The Application of Probit Analysis to the Results on 
Mental Tests.” Psychometrika, ХІХ (1944) , 31-39. ۴ " 
Hoyt, C. J. "Test Reliability Estimated by Analysis of Variance. 
Psychometrika, VI (1941), 153-60. 


EDUCATIONAL AND PsYCHOLOGICAL MEASUREMENT 
Vor, XXIII, No. 1, 1963 


ONE-WAY ANALYSIS OF COVARIANCE FORTRAN 
PROGRAM FOR THE IBM. 7090 


ANITA ZOE LEWIS 


Biometric Laboratory 
The George Washington University 


I. General Description 


A. This program performs an analysis of covariance with а 
single variable of classification and unequal sample sizes. 

B. In addition, each subject's adjusted post-treatment score 
may be computed, expressed as his deviation from the re- 

| gression line. This is optional. 

C. The maximum number of treatment groups is 10. As many 
as 100 subjects are allowed in one treatment, group. The 
upper limitation on the total sample size of all treatment 
groups is 1000. 


П. Card Preparation 


A. Problem card 
cols. 1,2 problem number (not = 99) 
3,4 number of treatment groups 
5,6 number of scores on each data card 
7 = 1 if each subject's adjusted post-treatment 
score is to be computed. 
= 2 if each subject's adjusted post-treatment 
score is not to be computed. 
B. Group name cards 
cols. 1-20 name of first treatment group 
21-40 name of second treatment group 
41-60 name of third treatment group 


191 


192 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Continue on next card until all groups are represented. 
C. Input data cards 
cols. 1,2 problem number (must match problem card) 
3-8 Subject identification—used to check correspond- 
ence between X and Y cards à 
9 Period—used to check the order of the cards. 
X period must be < Y period 
10-12 Identification—sent to output with adjusted post- 
treatment score option 
13-72 Score for variables 1 through v where v < 10. 
Each score occupies 6 columns and may consist of. 
either an integer or a number punched with a deci- 
mal point $ 
999999 signifies a missing observation. 


Data cards come in pairs—X (pre-treatment) values in the 
first card, Y's (post-treatment) in the second. The first X and Y 
values on each card will be treated together independent of other 
X or Y values on the card (if there are any). The second set of values 
will be treated next, etc., with a maximum of 10 sets (number of 
variables). 

One treatment group follows another. A card with 9's in cols. 
1-10 signifies the end of a treatment group. 

D. More than one problem can be processed consecutively. 
Simply place one deck (card types A through C) after another. A 
card with 9's in cols. 1-10 signifies there are no more problems 1n 
the deck. This means that two cards with 9's in cols. 1-10 must be 
put after the last group of the last problem. 


III. Output 


А. Output cards (optional) 
cols. 1-8 Same as input data cards 
9 Blank 
10-12 Same as input data cards 
13-72 Deviation from regression (y — f) for variable 
1 — v, where v < 10 (6 digit field with one digit 
after the decimal point. 9999.9 signifies 
data.) 
B. Printed S: 


ANITA ZOE LEWIS 193 


Identification page: 

GEORGE WASHINGTON UNIVERSITY 

BIOMETRIC LABORATORY 

Standard page: 1 variable/page 

ANALYSIS OF COVARIANCE 

PROBLEM xx, VARIABLE vv 

TREATMENT DF SUM X8Q SUM XY SUM Y SQ 
ا ا‎ 


2 dec. 
B 


4 deo. 
DF SUM D SQ MEAN SQ Е 
a. 


2 dec. 
Table as in Snedecor' (p. 401). 
TREATMENT MEAN X MEAN Y ADJ. MEAN 
му رسس‎ 


2 дес. 


Note: If there is only one treatment group, nothing below line 
one of table found in Snedecor (p. 401) will be printed. 


IV. Mathematical Development? 


A. Analysis of Covariance 
X, = Sum of X scores for one group. 
N, — Number of scores in a group, etc. 
X, => XJN, Y, => F/M 
È Х, = Sum of X scores for all groups, etc. 
X, = ith score of a group, ete. 


For group rows: 
df,, mE N, — 1 
Dr = Dx, = ху = уз (x) – 22, EX, + ND 


į 2 » (X3) FE AS 2: X, 
E*-Eo»-fEY i 
Ai 2. (x, X, 2505 =; ¥.) 

= х,у, - 2, Уу, – Р, И c NL 
Е = 5х,т, 2, ET 
Stato т, George W., Statistical Methods (Fifth Edition). Ames, Iowa: Iowa 
1 ege Press, 1956. : 
footnote 1, pp. 122-126, 394-403. 


= 2/227 

dfa = dfa — 1 
У, (У, – xy = Xy — ¥, — b — XJ 
Y – а Ў, 55)" A p =0 
YS Gf — 2briy: + Ёш) b= У 20/2 
Уу — (0/0 a 


For “within” row: 


dfsa — X dfa 
УФ = (50) 
S, = У 0/41. 


For “common” row: 


Уй 


df. X Dd df, 
2x24 SE 223528 а?) 
zy, = 2(2 29) 
L6 25 (22 y 
b, 3 uz 
df. - df. EL 
d= Ly- (У ay). 
б = 35 d/df. 


For “regression coefficient" row: 


df, - df. "i dfna 


194 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
We want to compute b such that 
У) (y: — bz)” is minimum 
4 (Sy -2 Day +i Da) = —2 Day +2 35-0 


F, -— S/S 
For “total” row: 


È zy = E e Ak Y Y, 


dfa = dfa — 1 


Dd = Dy — (Da) / = 


s 
M 
Б 
i 
м 
5 
^ 
M 
M 
лр чч ы 


ANITA ZOE LEWIS 195 


For “adjusted mean" row: 


а}. = а} » а}. 
La= уйге Бш 
S; = У; di/df.. 
For end of table: 


Mean X, =) XN, 
Mean Y, = >> Y,/N, 


Adjusted mean, = (>> Y,/N) — (95 X,/N, — X) 
B. Linear Regression 
= Y; = Y, - b. (X; + X) 
V. General Outline 


For each problem: 
1. For all variables of one group (Repeat for each group) 
- Store all subject identifications, X's and Y's. 
. Take > Х,У Y, У XY, У) X^, УУ for each variable. 
. Sum the groups for “total” computations later. 
. Compute >> 2”, > 12, У) ту, b for each variable. 
. Put X's, Y's and subject identifications on tape for 
computations later. (optional) 
F. Save У) z^, X; y^, © zy, У) X, У) Y for each group for 
each variable. 
G. Save N for each group. 
П. For one variable (repeat for each variable) 
A. Sum У 2’, У y, У) zy, У) N to be used in “common” 
computations later. 
B. Send out treatment name, df, >> 2, > zy, 2, y^, b, df, 
У) аут, mean square for each group. 
C. Sum Y? d,? to be used in “within” computations later. 
D. Send out rows: within, reg. coef., common, adj. means, 
E 


EBoour»r» 


total. 

- Send out mean z, mean y, and adjusted mean for each 
group. 

- Compute deviation from regression for each subject for 


each variable and punch out with identification. (option- 
al) 


i] 


196 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


VI. Machine Requirements 


А. Logical tape 
number 

5 Input 
6 Output to be printed 
9 Output to be punched (optional) 
4 Intermediate storage 

B. Approximate timing: 

Essentially tape time. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 1, 1963 


FAST PROCESSING OF PSYCHOLOGICAL TESTS USING 
HIGH SPEED COMPUTING MACHINES 


RUSSELL N. CASSEL 
Lompoc Unified Schools, California 
AND 
RENE DE LA BRIANDAIS 


Delta Data Corporation 
Richmond, California 


Tur FAST system for administering and scoring of psychological 
tests, and for the preparation of such data, is the “fully automatic 
Scoring technique"—a procedure initially used by the California 
State Department of Education Data Processing Pilot Project under 
Title X of the National Defense Education Act (NDEA) at Rich- 
mond, California, late in the year 1960 (Grossman & Crisler, 1962). 


Procedures 
Administration 


IBM Card Sets. Specially developed International Business Ma- 
thine (IBM) mark-sense cards or card sets designed for the respective 

ts involved are substituted for the conventional answer sheets as 
described by Wilkes (1961, 1962). Students record their answer 
Choices to the tests directly on such referenced cards through use of 
he familiar electrographic pencil. 
3 Instructions, The instructions developed from a given test stan- 
ardization and provided by the test publishers are followed pre- 
"80у as given except that any references to the answer sheets are 


Altered appropriately to refer to the answer cards (Durost & Mc- 
Quitty, 1961). 


197 


198 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT | 
| 
| 


Preprocessing of IBM Answer Cards 


Generally, card sets (one or more answer cards, and pupil identi- 
fication card where necessary) are prepared to accommodate a given 
test or test battery for each individual taking the test. Such sets can 
be readily integrated with existing pupil identification systems that 
already involve automated data processing techniques. Or a new sys- 
tem may be easily established through having each pupil accomplish 
specially prepared mark-sense identification cards, which are pre- 
punched with numbers to agree with the punched numbers on the 
answer cards of the particular pupil involved. 

Visual Screening. During the initial handling and arranging of | 
answer cards at the data processing center, observations are madeto _ 
determine the presence of such gross conditions as card mutilations, | 
improper markings, ог incomplete cards. Such conditions, to be sure, 
are necessarily corrected before further processing is attempted. 

Preliminary Processing. During this stage the cards are processed 
by conventional tabulating equipment. The sorter is used to effect 
common facing of the cards, where necessary, and to gather the 
answer cards into sets for each student. The reproducing punch is 
used to make holes corresponding to the student recorded answer 
choices in order that the high speed computer can read the test data. 

In addition, the reproduction punch may be used to complete the 
pupil identification cards. 


Scoring and Reporting 


The computer aspect of the processing is proposed for the sole 
function of high speed computer machines. 

Computer Processing. The entire scoring of tests can be performed 
by the 1401 computer or by the 1401 and 7090 computers in combi- 
nation. When these two high speed computers are used in combina- 
tion, a magnetic tape accomplished by the 1401 machine is used 8$ 
a medium of communication. Precise adherence to the test company 
standardization procedures and normative findings are followed in 
the scoring and utilization of interpretative data. Ў 

Reporting. Апу kind of report that is mechanically producible 
with the available data can be generated. This includes all of the 
customary as well as certain other reports which have usually only 


CASSEL AND DE LA BRIANDAIS 199 


viously been available for research purposes at considerable ex- 

. Some of the more common reports include 

(a) Alphabetical listings—any desired groupings; 

(b) Frequency distributions—any desired groupings; 

(с) Statistical data—means, standard deviations, and standard 
errors; 

(d) Individual records—profile sheets and pressure sensitive la- 
bels; 

(e) Norm comparisons—national, local, individual norms; 

(f) Item counts—numbers of right responses, and information 
for item analysis and internal validity studies; 

(g) Statistical studies—such as those in correlation and regression 

analysis. 


Advantages 


Often an implication is made that the benefits of the computer or 
data processing equipment still are largely in the future; in reality 
Such services are immediately available, even to those school dis- 
tricts that may not have their own data processing equipment. 

- Economy in Time and Cost. Through use of high speed computers, 
the processing time for psychological test data of districts under 
10,000 pupils has been shortened to perhaps 20 or 30 minutes; even 
for districts with more than 10,000 pupils only 2 or 3 hours of proc- 


Answer card developed and used by the FAST system is usually 
®asier for the pupil to utilize than is the conventional 8% X 11 
Answer sheet, since he ean place it closer to the test booklet and can 
Adapt it to the smaller writing space often found in the administra- 
tion of standardized tests. It is also highly durable. In spite of its 
Small and compact size, the FAST system answer card generally 
affords more space for the recording of answers and thus makes it 
Sasier for pupils to record their answers in the proper spaces. This 

accomplished largely by increasing the number of cards utilized 
for a particular test or test battery. It could be expected that, for 


200 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


pupils with low average and below average intelligence, modest 
gains in reliability and validity of the test data could be realized. 
With such pupils the task of marking answers often is more complex 
than the answering of the test item; that is, the pupil may know 
the proper answer to an item, but be unable to find the proper space 
in a single large answer sheet on which to record it. 

Reduction of Mark-Sense Error. The use of а specially prepared 
abrasion power applied on the IBM answer cards improves the 
quality of electrographic pencil markings through adding more lead 
to be deposited; hence better electrical conduction is afforded. Also, 
the use of the powder makes the erasure and correction of items by 
the pupil easier to accomplish. Because of the larger marking space, 
the greater amount of lead marking also tends to increase electrical 
conductivity. 

Built-in Checks. Multiple electrical check circuits have been in- 
stalled in computer equipment at critical points in the processing 
mechanisms. They are specially designed to provide continuous 
monitorship of the accuracy of the processing function. Through 
such checks the fidelity of data processing has been significantly 
increased. 

Elimination of Incomplete Cards. The computer is programmed to 
inspect the information on the card for the purpose of determining 
the presence of inadequate or incomplete marking patterns, and for 
checking the accuracy of the punching. This may include such things 
as: double entries, failure to respond, incomplete sets of cards, im- 
proper markings (ball point pens, waxy pencils, ete.), a8 well as 
many other defects. 


REFERENCES 


Durost, W. N., and McQuitty, J. V. “An Experimental Evaluation 
of the IBM Separate Answer Sheet versus the IBM Separate 
Answer Card for Machine Scoring.” Test Service and Advise- 
ment Center, Dumbarton Center, New Hampshire, 1961. 

Grossman, A. and Crisler, D. “Data Processing Pilot Project: A 
Progress Report for 1960-61.” California Schools, May 19 А 
Vol. XXXIII, No. 5. Sacramento, California, Department 9 
Education. 

Wilkes, C. F. “Cross-Validation Comparative Data on FAST Scor- 
ing Cards and IBM Answer Sheets." (Unpublished Report). 
Richmond Schools, Richmond, California, 1961. D 

Wilkes, C. F. “Fully Automatic Scoring Technique—FAST. A p 
mary Report." Richmond Schools, Richmond, California; 1967. 


Колона. AND PSYCHOLOGICAL MEASUREMENT 
fou. XXIII, No. 1, 1963 


BOOK REVIEWS 


Edited by 
WILLIAM B. MICHAEL 
University of California, Santa Barbara 


McNemar’s Psychological Statistics (Third Edition) JULIAN 

IU. STANLEY sean one mep oeann uto ЫККА ЕК ERE 
Winer’s Statistical Principles in Experimental Design. ОАТ” 

B WILEY ........хаажаз жа бла О ооо 
Ray's An Introduction to Experimental Design. GENE V. GLASS 

AND: JULIAN C. STANLEY .......... жеее енот 
Block’s The Q-Sort Method in Personality Assessment and 
- Psychiatric Research. HAROLD Вовко..............++.+- 
Arnspiger’s Personality in Social Process. ARTHUR J. BRODBECK 
Witkin, Dyk, Faterson, Goodenough and Karp’s Psychological 
— Differentiation. A. JEAN AYRES .... eee e 
Van Dalen and Meyer's Understanding Educational Research. 
Gene V. Grass AND JULIAN C. STANLEY .... nnn 
Colman and Smallwood’s Computer Language, An Autoinstruc- 

tional Introduction to Fortran. Еллот M. CRAMER ........ 
Miller's An Introduction to the Calculus of Finite Differences 

and Difference Equations. Еіллот М. CRAMER ....... see 
Mager’s Preparing Objectives for Programmed Instruction. 

DesmMonp L: Cook Ошон, Se 
Deterline's An Introduction to Programed Instruction. Hans 
GEORG STERN... ЖООМ, OD TENERO D A АН 


| EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 1, 1963 


Psychological Statistics (Third Edition) by Quinn MeNemar. New 
York: John Wiley & Sons, Inc., 1962. Pp. vii + 451. 

The first (1949) edition of this textbook designed for a year-long 
course taken by quantitatively apt students and taught by a knowl- 
edgeable instructor steered a judicious course between the Scylla of 
cookbook recipes and the Charybdis of excessive proof and deriva- 
tion. The second (1955) edition followed the same pattern, with im- 
provements. The third edition seems less changed, though a number 
of alterations increases its scope and clarity. 

This edition is somewhat longer than its predecessor, 451 versus 

408 pages with slightly more words per page, chiefly because of the 
one new chapter, “Trends and Differences in Trends” (pp. 346-361), 
and the exercises for Chapters 10-20, which are new. The title of 
Chapter 14 has been changed from “Comparison of Variabilities” to 
"Inferences about Variabilities.” Other titles, and the order of the 
19 chapters from the second edition, are unchanged. The same seven 
tables appear in the Appendix, in more legible type than before. 
i Besides the chapter on trends and the new exercises, the Preface 
informs us that “Other major additions include the following: the 
development of the error formula for the difference between inde- 
pendent proportions; the empirical work on the effect of assumption 
violations on the t and F tests; significance tests for regression co- 
eficients; a greatly extended discussion of the effects of errors of 
measurement ...; a brief presentation of an old topic, the correlation 
of sums; a simple, though ancient, derivation of the standard error 
of the mean formula; a proof that s? is unbiased; a derivation show- 
ing the connection between variance and x, and the deduction of 
the x, t, and z (critical ratio) tests from the F test; an algebraic 
determination of the expected values of variance estimates in one- 
way analysis of variance; more on reliability by way of analysis of 
Variance; the connection between the concepts of interaction and 
correlation; a further explication of Latin square usage in psycho- 
Ogical research; and additional nonparametric techniques, along with 
4 discussion which indicates that I disagree with one of the reasons 
advocated for the usage of nonparametric methods” (pp. v-vi). 
MeNemar also, on pages 285-287 and 345 under the headings “se- 
ected contrasts” and “selected comparison,” discusses multiple com- 
parisons of means, using Scheffé’s procedure. 


203 


204 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


As reviewers always do, this one wishes that certain other topics 
had been covered, too. For example, the finite-coefficient model of 
Cornfield-Tukey and others might have been introduced, perhaps. 
thereby making the “Analysis of Variance: Complex” chapter a bit ' 
less complex by eliminating the necessity for so many “cases.” Also, 
one would like to see more emphasis on confidence intervals than is 
present. F 

Repeated-measurement designs are used illustratively a great deal - 
in the book, but they seem not to be distinguished clearly from other 
factorial and randomized block designs." No reference to the work 
of Geisser and Greenhouse, Lubin, Box, and others on problems 
caused by repeated measurement appears, but this is not surprising. 
when one considers the sparse citation in the entire book. (Sir Ronald 
Fisher has five page references in the index, to two books, but Tukey, ` 
Kempthorne, and Wilk (and McNemar) have none.) As in previous | 
editions, the instructor using this text will have to supply his own 
references for nearly all supplementary sources. The author covers 
such topics as errors of measurement without citing the work 
Gulliksen, Lord, or Hoyt. Usually his discussions. are rather self- 

contained, but the bridge to further learning has not been provided. 
Students will be exposed to a great deal more symbolic thinking 
than in most statistics texts intended for psychology and education, 
but they will not learn from this book the kind of vocabulary (plot, 
block, variance component, crossover design, ete.) that will smooth 
the transition to the work of Cochran and Cox, Biometrics, and other - 
experimental-design sources. They will also learn more psychometric | 
statistics than in any other statistics book of which this reviewer 
aware, for МеМетаг follows in the tradition of his mentor Kell 
He has succeeded far better than others in effecting a rapprochem 
between associational analysis and experimentation, though the 
are still in the primitive “parallel play” phase of development. 
In summary, if one has been pleased with the second edition 
Psychological Statistics as the main textbook for a statistics 
one will probably welcome this third edition, which very lik 
means that one has bright students who can face its relatively § 
bolic but mathematically elementary approach with determinati 
for a year. Very likely the typical student will not sell his copy W 


1 McNemar perceives this, for on page 338 of the second edition and 340 
the current one he states, following the “higher-order classification” discus 
and under the heading “Factorial and Latin Square Designs”: “The stud 
who encounters the term ‘factorial design’ will need to know that it is d fh 
to make a distinction between factorial design and the analysis of và 
setups discussed in this chapter. The bases for classification are referred to 
factors; the categories within a classification are termed ‘levels.’ Perhaps © 
term factorial design is inappropriate when one basis for classification 18 pe 
sons." 


BOOK REVIEWS 205 


the course is over, preferring to keep it for frequent use. The cover- 
age is unusually extensive, the treatment systematic, and the expo- 
sition lucid. 

JULIAN C. STANLEY 

Department of Educational Psychology 

University of Wisconsin 


Statistical Principles in Experimental Design by B. J. Winer. New 
a McGraw-Hill Book Company, Inc., 1962. Pp. x + 672. 
12.50. 

This book by the Editor of Psychometrika, who is & professor of 
psychology and statistics at Purdue University, admirably fulfills 
the promise of its title. It elucidates the statistical principles of ex- 
perimental design. As the jacket blurb states, “although . . . mathe- 
matical proofs are not presented, the logic basic to such proofs is 
made explicit.” That is, Winer makes clear the mathematical bases 
for the statistical techniques used in experimental design. This point 
should be underscored because even though the statistical logic is 
there, in abundant detail, the logic of experimental design is, by 
and large, missing. It is this emphasis on the statistical which over- 
whelms the book. 

Experimental design lies on a fringe area between statistics and 
the substantive area to which it is applied. As such, it partakes of 
both. Certainly, given the relevant mathematical conditions, an ех- 
perimental design can be shown to be statistically optimum ш а 
given situation. However, it is the substantive area and the particu- 
lar application which determine these conditions. Therefore, if one 
does not desire to give just a compendium of designs or to write a 
text which is to be used in all areas of science, one should make very 
clear the special problems arising out of the area of application and 
attempt to indicate their solutions explicitly. Since apparently this 
book is aimed at psychologists, it ought to treat the special design 
Problems of psychology. 

As one can see from perusing the chapter titles, Winer covers re- 
peated measurements, nested designs, quasi-F-ratios, incomplete 
designs, covariance, non-parametric techniques, and many other 
topics. From these headings it is obvious that the author treats many 
of the special problems of psychology. What he does not do is to 
tell the reader why these topics are important. Obviously the reader 
needs to know under what conditions these problems arise so that 
he may judge when a particular design may be put to good use. 

Basically what is given is a set of designs and their statistical 
bases (as indeed the title leads one to expect). In addition to this, 

е experimenter needs someone standing over his shoulder saying 


206 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


such things as “that design is not robust with respect to these as- 
sumptions and we should therefore take the following new factors 
into account." 

In addition to this major point, the book has many small errors 
characteristic of a first edition. 

Some of these errors take the form of ambiguity (p. 14), incon- 
sistency of text and figures (p. 29), use of terms without adequate 
definition (expectation operator, p. 40), inconsistency between text 
and index (Scheffé’s classic dated 1959 in text, 1960 in references), 
and similar correctable errors. 

Other “errors” take the form of what the reviewer considers to be 
misplacement of emphasis. The author adequately covers the con- 
cept of a confidence interval (p. 22) in the usual Neyman-Pearson 
sense but disregards the highly respectable Bayesian interpretation. 
On pp. 95-96, the author talks about Bartlett’s and related tests of 
heterogeneity of variance without emphasizing just how very much 
they are sensitive to non-normality. He even quotes Box’s famous 
“rowboat analogy” (p. 219) without seeming to relate it directly 
to his previous statements. 

Finally there are the near and actual errors and misconceptions. 
An example of this is the expression on p. 61, that 


e) E e: + пот, 

res с. 

There is a statement that the mean squares need to be independent 
for the expression to be valid. In actuality this expression is just not 
exactly true under any conditions, because the expectation of a ratio 
is not the ratio of the expectations. 

In his discussion of the studentized range statistics (p. 81), the 
author gives the impression that he does not realize that the distri- 
bution of the studentized range statistic changes depending on 
which portion of the ordered sample yields the difference, e.g 
Еу) — Youn) > Elyana = Yansai-1], where the brackets indi- 
cate the greatest integer function and the parentheses indicate the 
order statistics. 

On p. 118 he talks about a quantity which he symbolized 48 
а? m- т»; by this he presumably means E[(T, — T,)/2]. In the ma- 
nipulation of these symbols he talks about the possible correlation 
between т and e, but т is а constant and so there can be no corre- 
lation between the two quantities. The author may mislead some 
of his audience into thinking that 7 is a random variable. ne 

In spite of these relatively minor shortcomings, the book is, 7 
many respects, the best text on this topic yet to appear. The author 
gives many pedagogically excellent, correct explanations. He makes 
clear the distinctions between the fixed-effects and random-effects 
models (p. 62). He gives an excellent discussion of nested design? 


BOOK REVIEWS 207 


(pp. 184—191). And he gives a good explanation of repeated-meas- 
urements experiments (pp. 116-124). 

The book is, in this reviewer's opinion, the most technically correct 
text on experimental design for psychologists ever written. How- 
ever, for those interested in applying experimental design to their 
area of psychology, it will need extensive supplementation in the 
logic of experimental design. 

Inits area, this book seems likely to become a landmark. 

Davi E. Winey 
Department of Educational Psychology 
University of Wisconsin 


An Introduction to Experimental Design by William S. Ray. New 
York: Macmillan Company, 1960. Pp. x + 254. 

Ray has written a textbook for the beginning graduate student 
in which the principles of experimental design in psychology are 
introduced and discussed. Few sources exist in which basic topics 
5 experimental design are dealt with at such length or with such 
clarity. 
| In the first of seventeen chapters Ray examines the motivation 

and rationale of the analysis of variability. From the definitions of 
dependent and independent experimental variables in Chapter 2 on 
through “incorrect decisions” and the assumptions underlying the 
analysis of variance in Chapters 7 and 8, the development of the 
subj ect is logical and coherent. Chapters 9 through 13 are concerned 
with specific classes of experimental designs, viz., matching designs, 
analyses of covariance, and factorial designs (with matching and 
with adjusting). Chapters 14 and 15 are entitled “Specific Compari- 
- Sons" and “Specific Comparisons in Factorial Designs" and deal in 
an elementary way with curve fitting. In Chapter 16 the methods 
previously outlined are extended to designs with two or more sup- 
Plies of subjects and to those with randomly selected factor levels. 
In "Specific Problems,” the last chapter, Ray examines the relative 
efficacies of different designs, the analysis of non-orthogonal designs, 
and the problem of fallacious matching. 

Whoever proposes to write a textbook faces a dilemma: in order 
that his work be a worthy contribution, first it must provide for the 
Acquisition of the material by the reader; and second it should serve 
as a useful reference work. These criteria would seem too idealistic 
jt 16 were not that psychology and statistics have some notably 

Successfu ” texts. 
| With respect to the first criterion, An Introduction to Esperi- 
mental Design is quite successful. The mathematics in the text is 
Very simple; the neophyte with high verbal and quantitative ability 
үш have little difficulty following the discussions. The emphasis 
hroughout the book is on the “common-sense” approach to experi- 


208 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


mentation and its results. Certainly, the more experienced reader 
will find that the subject takes on a “common-sense” aspect in Ray's 
discerning and lucid exposition. Two of the more salient successes 
are the discussion of incorrect decisions and the development of the 
analysis of covariance. Ray has made a calculated attempt to in- 
struct his audience, and he has accomplished his objective well. 

The didactic function of the text could be improved at a few 
points, however. Lamentably, the entire book contains just one prob- 
lem to be worked by the student. Although only minor objections 
can be raised against the text—e.g., the assumptions underlying the 
analysis of covariance are never clearly stated—Ray does lead the 
reader astray at one point. Repeatedly in the construction of theory 
and in the reporting of experimental results, the conclusion to “ас- 
cept the null hypothesis” is made. Undoubtedly this is intended to 
be a provisional acceptance contingent upon further experimental 
evidence. The ingenuous reader, that person for whom the book is 
intended, will probably interpret such a conclusion to be that ab- 
solutely no difference exists between the parameters in question— 
that the effect under test in an analysis of variance equals zero. The 
conclusion “do not reject” would seem more appropriate. 

In spite of its numerous strong points, An Introduction to Experi- 
mental Design is speared on the other horn of the textbook writer's 
dilemma. As a reference work, the book is inadequate. Anyone seek- 
ing a clear conceptualization of the choice of error term in the analy- 
sis of variance (via average values of mean squares, finite models, 
variance components, etc.) will be frustrated. The problem of the 
estimation of parameters has not been dealt with, the emphasis 
throughout being on testing hypotheses. Treatment of multiple com- 
parisons in analyses of variance is dismissed as perhaps being “too 
difficult for the beginning student” (p. 63). Moreover, the book has 
a poor 114-page subject index and no author index or over-all bibli- 
ography. The three tables in the appendix are almost unbelievably 
sketchy. For example, only the 95th percentile points of the x? dis- 
tribution are tabulated, and these just for degrees of freedom 1 to 10. 

Tt is apparent that Ray did not intend to accommodate those who 
want both a pedagogical instrument and a reference work under а 
single cover. An Introduction to Experimental Design would prob- 
ably not be adequate as the sole text for a course. However, if it is 
used as a supplementary text, we know of few other books which 
seem likely to serve its stated purpose as well. 

Gene V. Grass and JULIAN C. STANLEY 
Department of Educational Psychology 
University of Wisconsin 


The Q-Sort Method in Personality Assessment and Psychiatric Re- 
search by Jack Block. Springfield, Illinois: Charles C Thomas 
Company, 1961. Pp. ix + 161. $6.75. 


BOOK REVIEWS 209 


In 1953 William Stephenson published The Study of Behavior in 
which he described the Q-method of factor analysis and some of its 
"applications to the psychological investigation of personality. Al- 
though the book was well received, it resulted in relatively few dis- 
ciples. However, one of these disciples, the author of this present 
monograph, may deservedly achieve a greater degree of success than 
has the master. In this slim volume Block lucidly presents the de- 
tails of the Q-sort procedure and the advantages of applying this 
_ method to the study of personality types and psychiatric classifica- 
tions. The work is not, nor was it intended to be, a substitute for the 
| оге basic writings of Stevenson, Cattell, Cronbach, and others. It 
_18 a monograph, and as such is primarily concerned with a descrip- 
| tion and discussion of the special Q-sort procedure designated as the 
— California Q-set. 

Block began to develop the Q-sort items for assessing personality 
| variables about eight years ago. The present set of 100 items is the 
third revision. It is obvious that a great deal of thought, judgment, 
and analysis influenced the selection. The complete list of items is in- 
cluded in Appendix A, and the suggested instructions for their use 
are recorded in Appendix B. In essence the rater is instructed to sort 
the items into a specified 9-point unimodal distribution. Those items 
which are most characteristic of the subject being described are 
tated as 9 on the scale, and the items rated as being least characteris- 
lie are assigned to rank one. 

The advantage of the Q-sort method rests primarily on the fact 
that it provides a standard, albeit restricted, vocabulary and pro- 
cedure for describing personality characteristics. The method can 
be used by different raters to describe a single person or by a single 
Tater to describe many individuals. Block makes the important point 
that, when used to rate, describe, or compare individuals and groups, 
the Q-sort procedure can be independent of the Q, or person-to- 
Person, factor analysis technique. Q-factor analysis is only one of 
four research applications described by Block. The other three are: 


1) The comparison of item placements in one Q-sort with item 
1 placements in another Q-sort. AT 
2) The comparison of Q-item placements in one group of individ- 
uals with the Q-item placements in another group of individ- 
uals, 
3) De correspondence of Q-sorts with a standard or criterion 
-sort. 


ү By deseribing the methods and rationale of these procedures, 

lock has increased and updated our knowledge of the technique. 
By pointing out the kinds of research applications in which the Q- 
Sort is appropriate, he has extended the range of its usefulness. 
"hrough having developed a comprehensive set of 100 items, he has 
ade the California Q-set readily available for personality research. 


210 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


He has accomplished a great deal, and the monograph is recom- 
mended to those researchers who are interested in exploring the ap- 
plicability of the Q-sort method for personality assessment. 

Hanorp BORKO 

System Development Corporation 


Personality in Social Process: Values and Strategies of Individuals 
in a Free Society by V. Clyde Arnspiger. Chicago: Follett Com- 
pany, 1961. 

Education for what? Such was the question sharply and cogently 
put to the profession some decades back in a book by that title 
written by Professor Lynd. To it, there have since been many partial 
answers, but few comprehensive ones. Professor Arnspiger provides 
us at long last with the most workable comprehensive one we have 
yet had. 

Lynd’s question was meant to instigate and clarify value think- 
ing. What overriding objectives ought the American educational 
system have? How could such a general prescription be worded so 
as to be clearly relevant to the concrete decisions that school ad- 
ministrators and teachers have to make from day to day and minute 
to minute? How could such a general prescription be made equally 
relevant and responsive to the other institutions in the environment 
of the school—such as family, church, business and government— 
with their own policy codes and prescriptions? In short, the search 
was for a statement of goals of wide enough significance to be in- 
tegrative of school and society, but of sufficient clarity to be relevant 
to everyday decisions regarding practical educational matters. 

Value thinking has been suffused with certain rather widely shared 
assumptions that have acted as handicaps. The tendency is to see 
facts not as instances of value, nor of acts of valuing as themselves 
facts. Instead, because values involve “prescriptions” or “norms” of 
various sorts, they are divorced from factual matters, the latter be- 
ing reserved for what is rather than what ought to be. In this way, 
fact and value have become bifurcated until it seems hopeless to 
“patch” them together again, but the difficulty is only one of a 
thinker’s own making. It permeates and obscures even the best in- 
tentions in social science today, such as the general theory of action 
of Talcott Parsons. 

In Professor Arnspiger’s framework, values can be described as 
facts. To say that X values Y is to assert a factual proposition that 
can be tested quite apart from who makes the statement. We simply 
observe X’s practices in relationship to Ү and employ all the avail- 
able and devisable measurements of goal seeking. So too, all facts 
about social practices and behavior patterns can be examined in 
terms of the goals or objectives sought and/or realized through them. 
When we discover regularities in goal seeking, we can spea 


BOOK REVIEWS 211 


"norms"; and we can pay attention to the "ought" statements of those 
engaged in the practice. Although people may tell us what their 
norms are, they are not supplying us with any kind of data that is in 
principle of a different order from their other behavior. Their state- 
ments, while data, can be tested apart from their mere utterance. 
Frequently, their *ought" and their other statements or practices 
conflict. In fact, this is one of the most important features of the 
social psychology of "insight." 

If we were to list all the specific goals that people pursue, our set 
of categories would be so endless as to be unmanageable. The task, 
then, is to find а few abstract categories under which all these 
specific goals can be subsumed. The abstract categories must be 
comprehensive, so that no specific goal is left sticking out some- 
where by itself. Each category should, furthermore, be of an equal 
level of abstraction. Professor Arnspiger turns here to the scheme 
developed by Harold D. Lasswell at Yale in which eight categories 
are basic, each one derived from an institution that has persisted 
throughout the course of recorded history and which tends to be 
found in all cultures. The idea consists in this principle: whenever 
men have institutionalized a specialization to a value in so recurring 
à way, there must be something of what philosophers like White- 
head called "eternal significance" to that value. In any event, the 
important point is that the categories are derived from empirical 
Considerations of relative permanence in history and appearance 
across cultures. 

Although abstract, the resulting eight categories are to be con- 
sidered descriptions of what men value, of their goals. A particular 
goal does not exhaust the value, but is merely an indication or ex- 
emplification of it. How each specific goal is to be classified is to be 
dealt with like any other empirical problem of coding, although the 
Most important thing about the descriptive framework is that it 
compels any personality or group in any situation, however concrete, 
to clarify conflict and choice of goals by both (1) analytical coding 
of the specific issues that are under concern, as well as (2) seeing 
these issues in terms that transcend, while they are exemplified in, 

е ongoing situation. Science is an attempt to learn from experience, 

acquire insight, understanding and, as a result, a greater creativity 
and innovation, When we can see that this value conflict here before 

Us is similar to others around us (and to still others that have oc- 
curred in the past), we are liberated from resolving each situation 
We come to in ad hoc over-personalized terms. We are also forced 
become more self-consistent, since when we see our situation as 
Dot too different from those in which others have been in, we are less 
likely to prescribe in one way for others and in another way for 
Ourselves, In short, thinking abstractly about goals allows us to see 
Our particular situations of value conflict and choice with the breadth 


212 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


characteristic of science, since we are required to see it more gen- 
erally and comprehensively, not only in immediate personal and 
emotional terms. We might speak of this as putting the self in social 
context and process," in the wider stream of valuing coming out of 
the past, going on in the present, and to which we can contribute in 
the future by choices of goals we make now. We are, therefore, un- 
likely to be “alienated” from our society, by seeing our own capacity 
for intelligently intervening in it. The same value problems are be- 
ing faced all over the social map, although their details may differ 
in each instance. No one is quite alone or unique irrespective of what 
his special value problems are. 

Part of the answer, then, to Professor Lynd's question, is that there 
is now a descriptive science of values which may be taught and is 
being taught in schools throughout the country, from primary grades 
to graduate school levels. The aim is to use education primarily to 
help pre-adults master more scientific ways of thinking about value 
problems. Needless to say, the teaching job and the learning task is 
not merely a matter of learning how to use words. It is also, and 
primarily, one of learning how to apply them to learning itself, to 
all the interpersonal relations into which the school enters. 

Nonetheless, mere description, however edifying if done systemati- 
cally and comprehensively, does not by itself provide one with a 
necessarily intelligent prescription. There is probably no institution 
that has listened and talked more repetitiously about itself as being 
part of a democratic order, especially in America, than the school. 
The concept of democracy among educators has been a bit fuzzy 
and unfocused; the attempts to apply it to day-to-day procedure, 
consequently, have often resulted in the same kind of rigid unthink- 
ing rituals, however permissive, that the talk was meant to supplant. 

Professor Arnspiger builds upon the movement that John Dewey 
instigated, but he gives it a new clarity. The overriding objective 
that democracy prescribes is, in his language, “human dignity.” This 
prescription, this “ought,” is now related to the descriptive value 
categories, so that it can be operationally used in concrete circum- 
stances in descriptive ways. Human dignity is defined as & state 
where no participant pursues his value objective in such a way 8$ 
seriously and severely to overdeprive or to overindulge the pursuit 
of the value objectives of others in the process. The teacher who 
showers one child in the class with indulgences denied the others 
because that ehild has a father who might be in a position to get her 
а salary raise is engaged in a practice not conducive to human dig- 
nity. The teacher who continually nags and humiliates one partieular 
child in the elass because the father of that child voted against 8 
raise of salary for her is engaged in a practice not conducive to hu- 
man dignity. While the concept of human dignity is, therefore, ап 
“ought,” a prescription, a “norm,” and all of these characteristics & 


BOOK REVIEWS 213 


` avery abstract level, there is no difficulty in principle in operational- 
izing it through examining classroom practices (or any practices 
merely “discussed” in the classroom) in terms of what objectives 
such practices actuate as outcomes. Hight value categories are pro- 
posed. Four of these refer to welfare or “material” values: well- 
being, enlightenment, skill, wealth. Four of them refer to deference 
or “attitudinal” values: respect, affection, rectitude, power. Each 
is given the essential defining characteristics. It is the interplay of 
all, of any part of these, in the interpersonal acts occurring in the 
school that gives us a description of imbalance, slight or severe, that 
allows us to make a factual statement about the degree to which the 
democratic prescription has been realized or abrogated. Hard ana- 
lytical uphill work is involved—not snap judgments, quick impres- 
sions, or elegant literary intuitions about how to achieve democ- 
тасу. 

There is one feature of this general theoretical model which de- 
serves marked emphasis. We have often heard of “democratic values" 
—uttered, one may add, not always without sentimental and pious 
overtones. Democracy, in the Arnspiger framework, has to do with 
à balance or equilibrium among values, not with one set of values 
considered in absolute terms apart from the relationships into which 
they enter with the rest in a special instance or in a class of instances. 
The freedom to pursue any value is part and parcel of a democracy, ` 
no matter how “new” or “strange” the strategy or goal may seem. 
Democracy does enter in, and then descriptively as well as pre- 
scriptively in terms of relationships among values. There may be 
undemocratic personalities not only in the sense of those who prove 
costly or disruptive to others to have around, but also in the sense 
that they pursue one value at great costs to other values for them- 
selves. We may remark, too, that the pursuit of power is a perfectly 
democratic value, in the sense that a wish to play a larger share in 
setting or applying norms to govern the actions of others is the goal, 
but only when it is not pursued at great costs of imbalances to the 
self and to others in terms of all other values. Once such power is 
attained, it may be used democratically as long as it is employed to 
p imbalance rather than contribute further to it in total out- 

e. 

1 There is a great deal more to be found in Professor Arnspiger’s 
Dook than what has been purveyed. On the theoretical level, there 
18 an especially enlightening presentation of the difference between 
| a” and “unrealistic” thinking, relating the various mechan- 
Sms and strategies of each type of thinking to the more general 
Yalue theory. The book abounds, too, with empirical material that 
tomes from students who have attempted to apply the value theory 

their own value problems as personalities or as members of groups, 
55 Well as to the more penetrating understanding of both classic and 


214 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


modern literature. There is an especially interesting section in which 
students set difficult objectives for themselves in their extracurricular 
lives and report how they go about realizing them through the use 
of the value framework. Some of these objectives deal with the at- 
tempt to “undo” the personality damage that has occurred to de- 
prived persons among them. The dynamics of these detailed attempts 
to “intervene” in events so as to right prior imbalances, with all their 
rocky dramatic course to success, give signs of how a mind system- 
atically trained in social science can easily outdo our current publie 
media writers in making “stories” or “case histories” as instructive 
and thought-provoking as they are aesthetically compelling. One of 
the points stressed by Professor Arnspiger is that each of us under- 
estimates his capacity to supply values to others, and sometimes be- 
cause of this such resources go unused. When his students begin to 
realize this idea, the resources of their personalities are often freed 
for more effective interpersonal use. 

Why should a specialist in measurement within the social sciences 
concern himself with an operational theory of values, such as Pro- 
fessor Arnspiger has provided? The answer lies in the challenge and 
opportunity it provides to stretch existing measurements toward new 
frontiers and to improvise creative quantitative techniques that may 
be required. While doing so, measurement has a chance to become 
more relevant to what social behavior is ultimately about—the at- 
tainment of values. Measurement is itself a value—the attainment 
of more precision in skill. When that attainment is purchased with- 
out regard to other values, it becomes skill for skill’s sake alone—a 
policy that requires the neglect of other goals. Such kinds of pre- 
cision, which often prove fallacious, can be maintained only by cur- 
tailing rigorous study of all values. Research piles up selectively, 
however unwittingly, around what is most readily measured. Matters 
of comfort and security that constitute well-being come, along with 
wealth, in for the heaviest share of quantification instead of what 
most needs to be more rigorously measured—areas like insight and 
rectitude and affection. The practices of the measurement profes- 
sion as a whole thus give selective inattention and selective over- 
attention to different areas of valuation. Since society is increasingly 
being guided by scientific information, there are wider social ramifi- 
cations stemming from such selective measurement practices. Quan- 
titative skill is developed to permit some values to move ahead (a8 
far as increments in human insight and understanding are actually 
produced by such skill applications) while allowing others to lag 
behind. Thus, measurement experts can speak in favor of democracy 
in general but subtly subvert it by collective practices that are al- 
lowed to remain in imbalance. Such a “neurotic” pattern of word an 
deed is one example of what Professor Arnspiger calls unrealistic 
thinking in contrast to effective or critical intelligence. 


BOOK REVIEWS 215 


The scientific study of values cannot be confined merely to laws 
about content. These laws about content have also to be applied to 
our procedures. The Robinson Crusoe image that the scientist has 
tended to have of himself might have been an excusable delusion in 
the days when it was not yet glaringly apparent how much science, 
as a pursuit of enlightenment, could modify the value of the condi- 
tions in which it arose. The power of enlightenment is at last stagger- 
ingly apparent, and although social science still has a long way to 
go until it produces the equivalence of an atom bomb or a rocket to 
the moon, there is no doubt that such equivalent discoveries are be- 
fore us and that the seeds of them have already been intellectually 
planted among us. The sooner the expert on social psychological 
measurements and the expert on values come to an effective work- 
ing relationship, the sooner we can get on to the major discoveries 
ahead. Let us not merely study “achievement” as a skill. Let us ask: 
“achievement of what objective?” And follow it by: “achievement 
with what costs or gain to other achievements affected by it?" Let 
us enlarge and not constrict our map so that the norms of what we 
do may become the norms we most profess. This is what Professor 
Arnspiger calls out to us to accomplish. 

ARTHUR J. BRODBECK 
Psychology Department 
Wesleyan University 


and 
Yale Law School 


Psychological Differentiation by Н. A. Witkin, В. B. Dyk, Н. F. 

| Faterson, Р. R. Goodenough, and S. A. Karp. New York: John 

Wiley & Sons., Inc., 1962. Pp. xii + 418. $7.95. 
A new dimension of individuality, its related theoretical struc- 
ture, and supporting research are presented by Witkin and his asso- 
ciates. The dimension— psychological differentiation—is a continuum 
. Tepresenting divergent directions of development. Lying at one end 
of the continuum, the highly differentiated person manifests the 
- ability to be field-independent in visual and proprioceptive percep- 
tion and in his personal-social life. He has a greater sense of identity, 
associated with a clearly defined body concept and social self- 
sufficiency and aloofness. His approach to life and problem solving 
is analytical. Differentiation implies complexity and heterogeneity 
of à system's organization. While not necessarily effective and ma- 
ture, the controls and defenses of the differentiated person are chan- 
. nelized and structured. 

Located at the other end of the continuum, the field-dependent 
Person approaches life globally. His body concept is less well de- 
veloped; he has difficulty in separating visual and proprioceptive 
sensations from the context in which they lie. Children with а low 


216 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


level of differentiation are characterized by dependence, poor im- 
pulse control, little sense of responsibility, poor resources, and lack 
of enterprise and of initiative. 

Although the characteristies at the low end of the continuum ap- 
pear less desirable than their counterparts at the other end, differ- 
entiation is not related to effectiveness of integration nor to degree 
of psychopathology. Relationships are found, on the other hand, be- 
tween differentiation and complezity of integration as weil as direc- 
tion of psychopathology. Cognitive and personality characteristics 
show definite associations with degree of differentiation. Many sig- 
nificant positive correlations have been obtained between perceptual 
scores and intelligence quotients, performance items correlating more 
closely than verbal items. Field-independent children, however, do 
not necessarily score higher on intelligence tests. 

Witkin and his colleagues explored and reported on the relation- 
ship of differentiation with many additional areas of life, includ- 
ing sexuality, activity, self-consistency, and stability of the dimen- 
sion. They cite an extensive body of convincing research from their 
own laboratories and those of others. 

Differentiation, the authors theorize, results from articulation of 
experience—the analyzing and organizing of stimuli into meaning. 
In this way the individual becomes an autonomous agent, separate 
and apart from the field. Seeking determinants of developmental 
direction in the interaction of the child with his mother, the investi- 
gators support their hypothesis with data from interveiws with 
mothers, but they overlook the neurophysiological organization which 
the child brings to the situation and to which the mothers reacted. 

The possible role of neurophysiological processes in developing _ 
differentiation receives only brief allusion from Witkin and collabo- 
rators. They seem not to have recognized the full significance of the 
fact that their theoretical structure is built largely on the scores 
from three tests, two of which required the correct interpretation 
of proprioceptive stimuli, especially from the vestibular system. 
Linking mode of perception to other personal characteristics, they 
interpret the scores as merely reflection of perceptual preference, 
This point is of particular interest in view of the procedures use 
by various people throughout the country to facilitate the develop- 
ment of perception, body concept, individual identity—in fact, dif- 
ferentiation—in children with neurological dysfunction. These pro- 
cedures often involve eliciting proprioceptive discharge, especially 0 
the vestibular system. It seems highly possible that the perception 
of vestibular and somatic stimuli provides a foundation which par- 
ticipates in the determination of direction of development, rather 
than one which merely reflects a mode of response primarily deter- - 
mined by parental interaction. 

For reasons made clear by Witkin and fellow investigators, the 


BOOK REVIEWS 217 


concept of differentiation can be of value in diagnosis and under- 
standing of psychopathology; for reasons made less clear, the book 
offers much to the educator and clinician concerned with perceptual- 
motor dysfunction and certain types of learning problems. The study 
suggests exciting possibilities for helpful measurement of selected 
areas of intellectual and affective functioning. Instead of obtaining 
a quantitative score which is an end in itself, it may soon be possible 
and desirable to state the areas of neuro-physiological function 
which need to be activated to develop lagging cognitive and emo- 
tional development. 

А. JEAN AYRES 

Department of Occupational Therapy 

University of Southern California 


Understanding Educational Research by Deobold B. Van Dalen and 
William J. Meyer. New York: MeGraw-Hill Book Company, 
1962. Pp. xi + 432. 

Understanding Educational Research intends to *acquaint readers 
with the goals, basic assumptions, limitations, and language of 
scientists—with the way researchers talk and how their minds work 
in getting results." The book is written expressly for mature upper- 
classmen and graduate students pursuing a masters degree in educa- 
tion. The emphasis throughout is on the scientific method and how it 
1$ applied in educational research. 

In the first of sixteen chapters Van Dalen examines research as it 
relates to social progress. He waxes philosophical in the next three 
chapters, which treat methods of acquiring knowledge, general con- 
cepts concerning the scientific method, and the nature of observa- 
tion. Chapters 5 and 6, “Printed Resources for Problem Solving” 
and “Library Skills for Problem Solving,” stand alone. One chapter 
each on the analysis of the problem and the solution of the problem 
is followed by chapter 9, “Patterns of Historical Research," chapter 
10, "Patterns of Descriptive Research," and chapter 11, “Patterns of 
Experimental Research." 

Beginning with chapter 12, Understanding Educational Research 
assumes a different mien. “Tools of Research” deals briefly with 
sampling, questionnaires, interviews, appraisal instruments, and ob- 
Servation. William J. Meyer then offers 63 pages, distributed be- 
tween chapters 13 and 14, of descriptive and inferential statistics. 

he remaining two chapters are entitled “Writing the Research Re- 
port” and “Evaluation and Publication of Research.” Statistical 
tables and seven appendices illustrating the use of the scientific 
method complete the text. 

Though better than most of its predecessors, Understanding Edu- 
cational Research is symptomatic of a tendency for schools of educa- 
tion in large universities to become increasingly “inbred” and 


218 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


segregated from the rest of the institution. Either because their 
students cannot meet the requirements of other departments or 
because they feel that they have particular needs which cannot be 
met by courses extant, departments of education are attempting 
to teach those subjects which often might be taught more ably 
by others on the same campus. Statistics courses in departments 
of education and psychology meet needs not met by statistics 
courses in mathematics departments. On the other hand, little of 
what Van Dalen presents in the first four chapters could not be 
acquired in an elementary logie course in the philosophy depart- 
ment, The danger in establishing courses whose objectives might 
be met better elsewhere is that students will be exposed to a less 
comprehensive point of view and remain ignorant of the way im 
which the majority of scholars actually “talk.” The text can, how- 
ever, be evaluated apart from the more important question of 
whether or not the course for which it was prepared is worthwhile. 

Considering the fact that this book was written for mature upper- 
classmen and graduate students, one encounters some surprisingly 
elementary suggestions. For example, on page 98 the authors warn 
that “reading without needed glasses, in insufficient light, . . . is un- 
productive. To achieve success in research, you must schedule sensi- 
ble working hours; establish time-saving routines; get adequate 
food, rest, relaxation, and medical attention. . . .” On page 103 the 
mature upperclassmen is instructed to write his notes legibly, care- 
fully forming each letter and figure, and in ink to avoid smearing. 
Van Dalen earlier suggests that the researcher find a place free of 
interruptions and distractions when wanting to read. Obviously no 
stone has been left unturned. 

The 63 pages on descriptive and inferential statistics are another 
valiant attempt at the impossible goal of giving the novice & help- 
ful “feel” for statistics quickly. Meyer might have come as close to 
succeeding as a few others have done, were it not for numerous 
errors and inconsistencies. Some of these are the following: 


1. p. 285: The statement is made that multiplying each score п 
а set of scores increases the standard deviation of the scores by the 
constant, Does this mean multiplying the standard deviation by that 
constant or merely adding that constant to the standard deviation? 

2. p. 294: Variance is spoken of in its relationship to the Pear on 
product-moment correlation coefficient without the term’s haying 
been defined or even mentioned previously. 

3. p. 312: A 95% confidence interval is constructed around 8 
sample mean. The statement is then made that 95% of the subse- 
quent sample means will lie within the confidence limits. This hs 
usually incorrect. For example, if the obtained mean lay two stand- 
ard deviations from the population mean in the sampling distribu- 


BOOK REVIEWS 219 


tion of the means, then 5096 of the subsequent means would lie out- 
side of the limits established around the obtained mean. 

4. p. 330-335: In the discussion of x", degrees of freedom are de- 
noted by d.f. In the table of chi-square in Appendix A, n denotes 
degrees of freedom. This inconsistency is never brought to the read- 
er’s attention. 


Not all of the shortcomings of Understanding Educational Re- 
search are as trivial as the last one. For example, at one point in 
chapter 13 the reader is exposed briefly to partial correlation and 
multiple regression, while in chapters 5 and 6 the most elementary 
aspects of research are spelled out for him. The most serious short- 
coming of the text, however, is that its scope is too broad. The au- 
thors have attempted to condense four semesters of course work in 
statistics, measurement, and related areas into a single course. 

One question looms above Van Dalen’s text. Should the objectives 
of the text be those of a single separate course within a school of 
education? If this question is answered in the affirmative, there is 
_ still doubt in our minds as to whether or not Understanding Educa- 
tional Research can attain these objectives; but it is a worthy com- 
E to the other elementary research methods textbooks presently 
available. 

GENE V. Grass AND JULIAN C. STANLEY 
Department of Educational Psychology 
University of Wisconsin 


Computer Language, An Autoinstructional Introduction to Fortran 
by Harry L. Colman and Clarence Smallwood. New York: 
McGraw-Hill Book Company, 1962. Pp. xiv + 196. $3.95 text 
edition; $5.95 trade edition. 

Only a few years ago it would have been unreasonable to expect 
psychologists to write computer programs to take care of their com- 
putational needs. Some did, but it took an appreciable time to be- 
come proficient in the art of programming, and the task itself was 
extremely time-consuming. With the advent of automatic program- 
ming (the computer generation of machine-oriented statements from 
а smaller number of user-oriented statements), the situation has 
thanged. Programming in machine language, either numerical or 
symbolic, is still difficult and time-consuming; programming in а 
user-oriented language such as Fortran is comparatively easy Im 
terms of both the learning time and the time required to produce а 
Tunning program. With the larger and faster computers, program- 
| ud in such a language is efficient in terms of the programmer's 
ат and (at least for the nonprofessional) in terms of the computer 
Bh required. This has not been true of small computers such as the 

M 650, but it is certainly true of the IBM 7090. Fortran, the auto- 


220 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


matic programming language of the IBM computers and шап) 
others, is undoubtedly more widely used today than any other com 
puter language. A text which “constitutes a self-contained trainm 
system for introducing students of almost any background or pro- 
fessional interest to the details of Fortran coding and the art 

computer programming” would be welcome indeed. Unfortunately, 
Computer Language cannot be said to achieve this desirable bu 
difficult objective. 

It is maintained that the teaching technique used is based on th 
work of В. F. Skinner and related teaching machine applicati 
The novel aspect is that there is no overt response associated W 
new units of information; nor is there any verification. It is state 
that this does not decrease the effectiveness of the program of in- 
struction. No references are given; nor has the reviewer been ab 
to obtain documentation of this statement. Elimination of the 
sponse removes the opportunity to branch as a function of the 
dents' knowledge of the material. This is an important feature 
the teaching machine which has been incorporated into other а 

` instructional texts. Actually there are exercises at several points 
the text, although these are said to be used only as a confiden 
building device. These exercises may have the opposite effect. O 
reader on whom the material was tested went through almost 
entire book without making an error on the exercises. This certai 
had a confidence-building effect, although it was obvious that he did 
not understand several basie concepts. The problem seems to be not 
that the basic concepts are not explained, but rather the absence 
redundancy. Certainly more space should be devoted to the con 
of the “loop” and to “subscripted variables.” In the present fo 
the reader without mathematical training may find it difficult 
understand the difference between the variable J and the subseri 
variable X (J). 

Some readers will find the format of the book irritating wh 
others will consider it novel and conducive to maintenance 013 
terest. The material is presented in rectangles with arrows lea 
from one rectangle to another. The format of rectangles for 
page was individually designed, providing variety. A parti 
useful device is the serambling of the answers. The answers to 
cises on page 39, for example, are all on different pages. А 

Accuracy is extremely important in а text such as this which 
tended for self-study. The proofreading and checking leaves 
to be desired. The main programming example has ten errors 
and a number of the errors are definitely of the non-trivial 
For example, on page 67 it is stated that the size of an array ! 
floating-point number rather than a fixed-point number; on p 
143, 145 and 146, the instructions for “if” loops are such that | 
problem would not terminate. Each of them has the form a 


BOOK REVIEWS 221 


43 L=1 
L=L+1 
GO TO 43 


These latter errors appear in the diagnosis of the program written 
by the student. This misinformation is bad; what is worse, he is 
never shown the correct way to write the program. 

Tn view of many defects this text cannot be recommended; how- 
ever, considering the paucity of books on this subject, a second edi- 


tion might be of value. 
Eruror M. CRAMER 
Biometric Laboratory 
The George Washington 
University 


An Introduction to the Calculus of Finite Differences and Difference 
Equations by Kenneth S. Miller. New York: Henry Holt & Com- 
pany, 1960. Pp. viii + 167. $4.50. 

An introductory text on finite differences can be aimed at several 
quite different audiences. Samuel Goldberg’s Introduction to Differ- 
ence Equations, previously reviewed here, was written for social 
scientists and provided a suitably elementary treatment. Kenneth 
Miller has provided an introductory text for mathematicians with 
much greater demands upon the reader. The level corresponds 
roughly to that of a mathematics major in his senior year. For easy 
reading, somewhat more background is required. 

A short introductory chapter develops the basic operators and 
functions related to finite differences. The analogy between difference 
and differential operators is carried farther than in most other texts, 
and factorial polynomials and Stirling numbers are shown to have 
the same relationship to the difference calculus that ordinary poly- 
nomials and binomial coefficients bear to the differential calculus. 
The psychologist with training in calculus will find this chapter much 
more readable than the following ones. 

A second chapter on the representation of functions by the prod- 
uct of an infinite number of terms is essentially independent of the 
Test of the book. Chapter three presents a well motivated account of 
Bernoulli numbers and polynomials as related to the earlier material 
Оп finite differences. This leads to the derivation of the Euler- 

aclaurin sum formula and Stirling's formula for the factorial. A 
Particularly elegant closed form is given for summations of the form 
ЗК” in terms of the Bernoulli polynomials. The final and most diffi- 
cult chapter deals with the general case of the linear difference equa- 
tion. The emphasis is on the existence of solutions and the definition 


222 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


of linearly independent sets of solutions. The theory is then applied 
to the elementary case of linear equations with constant, coefficients. 
It is not to be expected that this text will be widely read by psy- 
chologists, but those with sufficient mathematical background should: 
find chapter one of particular interest. t ; 
Exuior M. CRAMER 
Biometric Laboratory 
The George Washington University, 


7 


Mager. San Francisco, California: Fearon Publishers, 1961. Pp. 


ч 


Preparing Objectives for Programmed Instruction by Robert F. 4 
' 


x + 62. $1.75. 


One of the most perplexing tasks facing the instructor of an in- a 


troductory course in educational measurement is how to teach stu- 
dents the process of stating instructional objectives in such a manner 
that procedures or instruments can be developed or constructed to 
ascertain the achievement of those objectives. Students too often 
come to an introductory measurements course without ever having 
been adequately taught or shown the need for careful behavioral de- 
finition of instructional goals in educational learning situations. 
Many current textbooks in measurement do contain sections on this 
topic in units dealing with classroom test construction. This re- 
viewer's experience has been, however, that students often go away 
still not fully comprehending the necessity of or possessing the 
necessary skills in preparing objectives so that they are measurable: 

This book was originally written to facilitate the statement of 
objectives (i.e., terminal behaviors) in preparing programmed in- 
struction materials. The basic nature of the presentation, however, 
is one that pervades education in general and measurement in par- 
ticular. Hence, this book is not only worthwhile for those interested 
in programmed instruction but carries useful knowledge for those 
persons in the field of educational measurement and evaluation. 
Mager has done an outstanding job by presenting in a simplified 
and straightforward manner not only the importance of stating ob- 
jectives explicitly in behavioral terms but also a technique whereby 
this goal can be accomplished. 

The author says that objectives should be meaningfully stated. By 
a meaningful objective, he means “...one that succeeds in com- 
municating to the reader the writer’s instructional intent. It 18 
meaningful to the extent it conveys to others a picture (of what à 
successful learner will be like) identical to the picture the writer has 
in mind.” Various examples of well and poorly stated obj ectives are 
presented to illustrate this point. i 

What are the procedures by which meaningful statements 0 
objectives can be accomplished? Through a carefully prepared ar 
interesting program, Mager presents to the individual three funda- 


2A 


, 
f BOOK REVIEWS 23 
considerations in preparing objectives. These are (1) identi- 


» terminal behavior by name, (2) defining the desired be- 
' by describing the conditions under which the behavior will 


е expected to occur, and (3) specifying the criteria of acceptable 
performance by describing how well the learner must be able to 
in order for the performance to be considered acceptable. 
. ples of how each of these considerations enters into the stating 

f of objectives are presented in the text through use of a modified ' 
System of intrinsic programming. The reader is not only presented 
initially with choices regarding properly stated objectives, but also 
given opportunity later to put his learning into practice by being 
to make a choice between several test items written and said 
to measure particularly stated objectives. 
= As an ultimate criterion of whether an objective clearly defines or 
states the desired outcome, Mager says this will be true only when 
‘one can answer yes to the question, “Can another competent person 
j select successful learners in terms of the objectives so that you, the 

- objective writer, can agree with the selections?" To make sure that 
› һе has accomplished his objectives, Mager follows his own prescrip- 
tion. A self-test is presented at the end of the book along with a 
— criterion of acceptable performance. 

Admittedly in sueh а small book, Mager cannot go into great 
- detail in tackling the complex problem of defining objectives nor 
does he deal with the more subtle and complex objectives such as 
ppreciations, attitudes, and understandings. He does limit his dis- 
cussion mainly to content objectives. While this restriction may be a 
- Weakness of the book, the orientation provided in terms of the fun- 
damental approach to objective definition and the presentation of a 
‘procedure overcome this limitation in the reviewer’s opinion. This 

book will become required reading for students enrolled in the re- 
| Viewer's measurements courses. It is suggested that it might be fruit- 
ul for other instructors to follow suit. 

" Desmond L. Соок 

The Ohio State University 


An Introduction to Programed Instruction by William A. Deterline. 
- Englewood Cliffs, New Jersey: Prentice-Hall, Inc., 1962. Pp. 
131. $3.00. 
E An Introduction to Programed Instruction provides a clear, con- 
“ise introduction to the rather new field of auto-instruction. In six 
rt chapters, the entire volume (exclusive of appendix and index) 
з only to 81 pages: While brevity is often a virtue, the reviewer 
s um E this case a longer treatment would have made the book 
ore useful. 

Chapter one provides some background on the learner and teacher 
and lays the foundation for the tutor-pupil approach elaborated 


RI 


224 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT y 


later in the book. Chapter. two traces a brief history of the auto- 
instructional movement, with proper acknowledgement to Pressey 
and Skinner. Certain principles of learning are presented in chapter 
three, with emphasis on reinforcement and its place in the theoretical 
framework of programing. Chapter four contrasts the Crowder and 


Skinner approaches, without, unfortunately, pointing out that they - 


are not mutually exclusive. Some experimental results on student: 
reactions to programing are cited in chapter five. Chapter six con- 
cludes with some predictions of the impact programming will have on 
students, teachers, and schools. ј 


Such material as is presented in the book is sound, but it lacks . 


depth. The volume would be of little help to anyone who is seriously 
considering buying programs for a school, to a teacher who is trying 


to program a part of his course, or to a college student who is at- , 


tempting a real mastery of the subject. While there are program 
samples in the appendix, they do little more than illustrate some 
points in the text. Moreover, they are limited in variety. 

Books, to earn a recommendation for the serious student, must 
be able to stand the test of comparison with other works that at- 
tempt to cover similar material in the same field. Unfortunately for 
this volume, there are several books available—Stolurow’s Teaching 
by Machine to mention but one—which cover the same subject in 
far greater detafl and in a much more thorough and scholarly fashion. 

The reviewer regrets to have to come to the conclusion that An 


' Introduction to Programed Instruction will be of limited usefulness 


to the serious scholar, although of some interest to the individual 


seeking a general overview of the field of programed instruction. 
Hans Сково STERN 
Los Angeles City Schools 


1 Stolurow, L. M. Teaching by Machine. Washington, D. C.: U. 8. Govern- 
ment Printing Office, 1961. Pp. 173, 65€. 


| 


EDUCATIONAL and 
SYCHOLOGICAL 


Lovis D. Сонкх 
University of Florida 


Hanorp A. EDGERTON 


Max D. ENGELHART 
Chicago City Junior Colleges 
E. B. GREENE 
Chrysler Corporation 


J. P. GUILFORD 


E. F. LINDQUIST 
State University of Iowa 
FnEDERIC M. Lord 
Educational Testing Service 
AnprE LUBIN 
Walter Reed Army Institute 
of Research & 
SAMUEL Messick 
Educational Testing Service 
WILLIAM В. Мїснлкї; 


University of California, 
Santa Barbara 


Performance Research, Incorporated 


University of Southern California 


MEASUREMENT 


Editor: G. Frederic Kuder, Duke University 
Associate Editor: John A. Hornaday, Greensboro College 
Assistant Editor: Joan F. Hornaday 
Business Manager: Geraldine R. Thomas 


BOARD OF COOPERATING EDITORS 


M. W. Rick. ON 


Richardson, Bellows, Henry and Co. 


Joun Н. RoHRER 
Georgetown University 
School of Medicine 
P. J. RULON 
Harvard University 


Davin SEGEL 
Indiana University 


C. L. SHARTLE ,. 
Ohio State University 


Н. C. TAYLOR 
The W. E. Upjohn Institute for 
Community Research 
THELMA С. THURSTONE 
University of North Carolina 


HERBERT A. Toors 
Ohio State University 

E. G. WILLIAMSON 
University of Minnesota 

Ben D. Woop 
Columbia University 


. Dororny ADKINS Woop 
di. University of North Carolina 


LUME TWENTY-THREE, NUMBER TWO, SUMMER, 


1963 


EbucartowAL awn PSYCHOLOGICAL MEASUREMENT 
Vol. XXIII, No. 2, 1963 


RELATIONSHIPS BETWEEN VARIOUS PSYCHOMETRIC 
PROPERTIES OF PERSONALITY ITEMS: 


ALLEN L. EDWARDS 4x» JAMES A. WALSH 
University of Washington 


Epwanps (1953) originally reported a correlation of .87 between 
the probability of endorsement of a personality item and the social 
‘desirability scale value of the item for a set of 140 personality 
items. This finding has been confirmed in various other studies 
(Cowen & Tongas, 1959; Edwards, 1957a, 1957b, 1959; Hanley, 
1956; Hillmer, 1958; Kenny, 1956; Taylor, 1959; Wright, 1957). 
Both the probability of endorsement of a personality item and the 
social desirability scale value of an item may be regarded as item 
cteristics or parameters. Additional item parameters may also 
Specified. One of these is the dispersion or standard deviation 
Of the distribution of social desirability ratings assigned to an item. 
Another is the probability that an item will be marked doubtful 
if subjects are given the opportunity to so mark an item when they 
ате not sure as to whether it does or does not describe them. A 
third is the conditional probability that an item will be endorsed 
or marked True given that it has been marked doubtful and if sub- 
jects are then forced to respond True or False. A fourth charac- 
istic is the probability that an item will be responded to con- 
5 tly if it is presented upon two separate occasions. 
„The tendency to give socially desirable responses to personality 
‘items has been discussed in a series of publications by Edwards 
nd his associates (Edwards, 1957b, 1961, 1962; Edwards, Heathers 
& Fordyce, 1960; Edwards & Heathers, 1962; Edwards, Diers & 
ке __ 
“Nati гевеагоћ was supported in part by Research Grant M-4075 from the 
Institute of Menta] Health, United States Public Health Service. 


227 


228 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Walker, 1962; Edwards & Diers, 1962). A socially desirable response 
is defined as a True response to an item with a socially desirable 
scale value or as а False response to an item with a socially un- 
desirable scale value. The tendency to give socially desirable 
responses is assumed to be a stable personality trait. If an item 
evokes this tendency with a certain probability upon one occasion, 
then it should also evoke the same tendency with a comparable 
probability upon a second occasion. Since the probability of a True 
response has been found to be a linear increasing monotonic func- 
tion of the social desirability scale value of an item, it follows that 
the probability of a socially desirable response is greater for items 
at both extremes of the social desirability continuum than it is for 
items in the central section of the continuum. Therefore, we might 
expect items with extreme social desirability scale values to be 
responded to more consistently than items with scale values in the 
central section of the social desirability continuum. 

Thurstone and Chave (1929) proposed that the dispersion of 
favorability ratings assigned to an attitude item might be regarded 
as an index of the item’s ambiguity. Similarly, we may regard the 
dispersion of social desirability ratings as an index of the ambiguity 
of a personality item and ambiguity should be associated with low- 
ered consistency. If this is true, then the probability of a consistent 
response should be negatively related to the dispersions of social — 
desirability ratings. 

If a subject is permitted to mark an item as doubtful, he may do 
so for a number of reasons. For example, he may be responding 
primarily to the content of the item and be sincerely in doubt as to 
whether it does or does not describe him, regardless of the item’s 
social desirability scale value. On the other hand, he may mark the 
item as doubtful because the socially desirable response to the item 
is not obvious to him. Both of these conditions might be expected 
to result in a lowered consistency of response. Thus, we should 
expect to find that the probability of a consistent response is nega- 
tively related to the probability that the item is marked doubtful. 

If a subject marks an item as doubtful because he is responding 
to the content of the item rather than to its social desirability scale 
value, then the probability of a doubtful response should have 
little or no relationship to the social desirability scale value of the 
item. On the other hand, if it is assumed that the nature of a socially 


EDWARDS AND WALSH 229 


desirable response is least obvious for items with neutral social 
desirability scale values, then the probability that an item is marked 
doubtful should be greatest for items with scale values in the central 
section of the social desirability continuum and least for items with 
extreme socially undesirable and socially desirable scale values. 

A subject may also mark an item as doubtful because he is in 
conflict between making an accurate response and a socially desir- 
able response, For example, an item with a socially undesirable 
scale value may, in fact, describe the subject and he may also be 
aware that a True response is socially undesirable. Similarly, an 
item with a socially desirable scale value may evoke conflict because 
the item does not accurately describe the subject but he is aware 
that to answer the item False is socially undesirable. Marking such 
items as doubtful may provide a minor resolution of the conflict and 
make the subsequent choice of an accurate but socially undesirable 
Tesponse more probable. If this is the case, the correlation between 
the conditional probability of a True response, given that the item 
has been marked doubtful, and social desirability scale value 
should be lower than the correlation: between the unconditional 
Probability of a True response and social desirability scale value. 
Furthermore, the regression line of the conditional probability of a 
True response should have a higher Ү intercept than the regression 
line of the unconditional probability of a True response, and the 
two regression lines should cross somewhere near the central section 
of the social desirability continuum. 

Acquiescence, or the tendency to respond True, in achievement 
tests has been discussed by Cronbach (1946, 1950). It seems obvi- 
ous, however, that acquiescent tendencies should not be influential 
if a subject knows the correct response to an achievement item. In- 
stead, if acquiescence operates as a determiner of responses at all 
In achievement tests, it should operate in those cases where a sub- 
Ject is in doubt as to the correct response. Similar considerations 
may apply to personality items, i.e, if acquiescence plays a part 
In determining responses to personality items, then it should be 
most likely to operate when а subject is in doubt as to whether to 
respond True or False. Thus, if acquiescence tends to increase the 
Probability of a True response when a subject is in doubt, we should 
expect to find that the conditional probability of a True response, 
Even that an item has been marked doubtful, is greater than the 


230 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


unconditional probability of a True response, and this relationship 
should hold for all points on the social desirability continuum. 

The present study was undertaken to provide evidence concern- 
ing the various hypotheses stated above about the relationships 
between the probability of a consistent response, the probability of 
a doubtful response, the conditional probability of a True response 
given that an item is marked doubtful, and social desirability scale 
value. 


Method 


All subjects in the experiment were paid participants. Subjects 
were recruited as a result of notices posted in residence halls and 
on bulletin boards, announcements made in psychology and sociol- 
ogy classes, and news stories carried by the University of Wash- 
ington Daily, stating that student employees were needed for a 
Test Research Project. Each student applicant was given a bro- 
chure to read, which described in detail the nature of the task to 
be performed. In selecting students for employment, an effort was 
made to obtain a good age spread and a variety of majors in order 
to have a diverse sample. 

Two groups of students were employed, one consisting of 110 
males and the other of 111 females. Each group met on Monday 
and Wednesday or on Tuesday and Thursday for a period of three 
weeks or a total of six sessions. Males and females met in separate 
groups. The task assigned to these two groups of subjects was to 
describe themselves in terms of a large collection of personality 
statements. The subjects were instructed to read carefully each 
item and to determine whether or not they believed the item accu- 
rately described them. It was emphasized in both oral and written 
instructions that if they were in doubt as to the correct response, 
ie, if they were not sure as to whether the item did or did not 
describe them, they were to put an X between the True and False 
columns corresponding to the item on their IBM answer sheets. 
They were then instructed to give their best guess as to the correct 
answer to the item by marking it either True or False. 

The items were arranged in test booklets of 300 items each. Each 
subject completed two test booklets or 600 items at each of the first 
five sessions so that a total of 3,000 item responses were available 
for each subject. At the sixth session the subjects were given 4 bat- 


EDWARDS AND WALSH 231 


tery of standard personality inventories. The testing of these two 
groups of subjects was completed in the Autumn Quarter of 1961. 
_ In the Winter Quarter of 1962, new announcements were made 
_ concerning employment on the Test Research Project and two new 
- groups of students were employed, one consisting of 47 males and 
7 the other of 48 females. The task assigned to these subjects was to 
| provide social desirability ratings of each of the 3,000 items on a 
- 9-point scale, following the instructions described by Edwards 
(1957b). No subject who participated in the self-descriptive study 
_ was employed to provide social desirability ratings. 
A serial sample of every twelfth item beginning with item No. 40 
was selected from the.items appearing in each of the eight test 
- booklets completed during the first four sessions. These 176 items 
_ were repeated in the last set of 300 items responded to during the 
fifth session. Measures based upon responses to the 176 items at the 
time of their first appearance in the test booklets will be referred 
| to as measures obtained during the first testing and those based 

_ upon the repetition of the items at the fifth session will be referred 
_ to as those obtained during the second testing. The present study 
is concerned with the relationships between various psychometric 
properties of these 176 items. 


Results and Discussion 


f- For each sex group separately the following item characteristics 
_ Were obtained for both the first and second testings for each of the 
- 176 items: 

; a E. ev. the probability that the SN was endorsed or answered 


9 2 SDSY: the social desirability scale value of the item on the 
9-point rating scale. 

К 3. Dis: the dispersion or standard deviation of the social desir- 
- ability ratings on the 9-point rating scale for each item. 

Table 1 shows the test-retest reliability coefficients of the above 
Measures for males and females. It is clear from the data reported 
І in the table that the means and the standard deviations differ very 
- little on the two testings. Furthermore, the two psychometric in- 
_ dexes, P(T) and SDSV, are highly stable for both sexes from first 
Ы to second testings. 

In order to determine whether the data for male and female 


232 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 1 


Test-Retest Reliability Coefficients of Three Item Parameters as Determined 
Separately from the Responses of Males and Females to 
176 Personality Items 


Item Parameter ти 1, ї, 81 8 


1. P(T): Males .975 .444 .468 .298 .313 
Females .980 .425 .446 .312 .317 
2. SDSV: Males .972 4.865 4.899 1.587 1.534 
Females .980 4.797 4.845 1.804 1.761 
3. Dis: Males .559 1.137 1.119 .211 .191 
Females .597 1.166 1.128 .221 .228 


subjects could be combined, correlations were obtained between | 
SDSV based upon the male and female judgments for both the first 
and second testings. For the first testing this correlattion was 97 
and for the second testing it was .99. Similarly, the correlation 
between male P(T) and female P(T) was .96 for the first testing 
and .97 for the second testing. With respect to Dis, the correlation 
between male and female values on the first test was .60 and on 
the second test the correlation was .52. Since all of these correla- 
tions are very similar to the test-retest correlations for the separate 
sex groups, as show in Table 1, it seemed reasonable to combine 
the data from the two sex groups to obtain pooled estimates of the _ 
item parameters. The test-retest reliability coefficients for these new 
estimates were .98 for SDSV, .98 for P(T), and .70 for Dis. 
Three additional item characteristics were determined for each 
of the 176 items and these were also based upon the combined re- 
sponses of the two sex groups. The additional measures were: 
4. P(X): the probability that the item was marked doubtful. 
5. P(T/X): the conditional probability that an item was marked 
True, given that it was marked X. 
6. P(C): the probability of a consistent response to an item 
upon the two testings. 


The test-retest reliability coefficient of P(X) was .83 and for 
P(T/X) the test-retest reliability coefficient was .51. А 

The measure of consistency of response used in the present study 
was based upon the number of identical responses, True-True and 
False-False, to an item upon the two testings. Table 2 gives the 
correlations between the estimates of this parameter for the 176 
items and the other item characteristics, based upon both the first 


EDWARDS AND WALSH 233 


and second testings. It сап be seen that the pattern of correlations 
is much the same regardless of whether the item parameters are 
based upon responses obtained at the first or second testing. 

The negative correlation between P(X) and P(C) shows that the 
greater the probability that an item is marked doubtful, the smaller 
the probability that it will be answered consistently. Similarly, the 
greater the dispersion of the social desirability ratings of an item, 
the smaller the probability that it will be answered consistently. 
These two negative correlations are in accord with hypotheses 
stated previously. 

Of interest also is the almost zero correlation between P(C) and 
SDSV. This result is to be expected, if the hypothesis that con- 
sistency of responses is a curvilinear function of SDSV is true. 
In order to test this hypothesis, a distribution of the 176 items was 
made in terms of each half-interval on the social desirability con- 
tinuum and the mean consistency index was then obtained for the 
items falling within each half-interval. Figure 1 shows the plot of 
these means and it is obvious that the relationship between con- 
sistency and SDSV is curvilinear and in accord with the hypothesis 
that the probability of a consistent response should be greater for 
items at the two extremes of social desirability continuum than for 
items in the central section of the continuum. For the data shown 
in Figure 1, the correlation ratio is .62. 

It was predicted that if subjects mark an item as doubtful because 
they are responding to the content of the item rather than to its 
Social desirability scale value, then P(X) should have little or no 
relationship to SDSV. On the other hand, if subjects mark an item 
аз doubtful because the socially desirable response is not obvious, 


TABLE 2 
Correlations between Various Item Parameters Based upon Responses to 176 Items 
on a First and Second Test and a Measure of Consistency of Response 
to the 176 Items 
ہس‎ 
Correlation with P(C 

Item Parameter Test 1 uit : X x 81 82 
— ameter — Tetl Tetai Ж. cmo АН 

P(X) —.557  —.046 105 100  .053 .052 

P(T/X) .000 15 ыз .570  .180 .178 


234 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


P(CON) 


SOCIAL DESIRABILITY SCALE VALUE 
К. 


Figure 1. Mean probabilities of а consistent, response for each half- 
on the social desirability continuum. А 


then it was predicted that items with neutral social desirability 
seale values would have a larger value of P(X) than items at the 
two extremes of the social desirability continuum. The fact that 
P(X) correlates only .33 on both the first and second testings with | 
SDSV supports the prediction that P(X) has little linear relation- 
ship to SDSV but this finding is also consistent with the second 
prediction if the relationship is, in fact, curvilinear. However, the _ 
plot of the mean P(X) values for each half-interval on the social | 
desirability continuum, as shown in Figure 2, shows only a very 
slight tendency for P(X) to increase from the socially undesirable 
end of the continuum to the central section, with no obvious decline 
in P(X) as the items increase in social desirability scale value, The - 
mean differences between the half-intervals are small and the data _ 
cannot be considered as offering convincing evidence in favor of 
the hypothesis that neutral items have a greater probability of. 
being marked doubtful than items with socially undesirable OF 
socially desirable scale value. Hi 

The hypothesis was also advanced that subjects may mark an _ 
item as doubtful because they are in conflict between making an 
accurate response which is socially undesirable and an inaccurate і 
response which is socially desirable. On the basis of this hypothesis, 
it was predicted that if subjects were forced to respond True or - 


EDWARDS AND WALSH 235 


False to items they had marked doubtful, then socially undesirable 
responses would be more likely to occur. As a consequence, the cor- 
relation between the conditional probability, P(T/X), and SDSV 
should be lower than the correlation between the unconditional 
probability, P(T), and SDSV. Furthermore, the regression line of 
P(T/X) should be higher than the regression line of P(T) for so- 
cially undesirable items and lower for items with socially desirable 
scale values. The data support this hypothesis, P(T/X) correlates 
.68 with SDSV on the first testing and .62 on the second testing 
whereas the correlations between Р(Т) and SDSV are .92 on the 
first test and .93 on the second. 

The regression line of P(T/X) on SDSV and the mean values of 
P(T/X) for each half-interval on the social desirability continuum 
are shown in Figure 2. Figure 2 also shows the regression line of 
P(T) and the mean P(T) values for each half-interval. If aequi- 
escence is regarded as tending to increase the probability of a True 
response when subjects are in doubt, then P(T/X) should be 
greater than P(T) at all points along the social desirability con- 


100. 


80 


PROBABILITY 


20 


275 325 эз 42 4m 628) 876 E O (125 


„ SOCIAL DESIRABILITY SCALE VALUE 


Figure 2. Regression lines of P(T), P(X), and P(T/X) on social desirability 
ne value. (Points plotted are the non values of the probabilities for each 
alf-interval on the social desirability continuum.) 


236 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


tinuum. A comparison of the two regression lines in Figure 2 shows 
that this is not the case. Among subjects who mark an item as 
doubtful, the probability of a True response is greater, i.e., P(T/X) 
is greater than P(T), provided the item has a scale value in the 
central or socially undesirable sections of the social desirability 
continuum. On the other hand, if the item has a socially desirable 
scale value, then P(T/X) is smaller than P(T). This finding is 
in accord with the hypothesis that if an item is marked doubtful 
and subjects are then forced to respond True or False, the probabil- 
ity of a socially undesirable response is increased. In other words, 
marking an item as doubtful does not increase the probability of 
a True response independently of social desirability scale value. 


Summary 


Six psychometric properties were obtained for each of 176 per- 
sonality items: P(T), the probability of a True response; SDSV, 
the social desirability scale value; Dis, the dispersion of the social 
desirability ratings; P(X), the probability of a doubtful response; 
P(T/X), the conditional probability of a True response given that 
an item is marked doubtful; and P(C), the probability of a con- 
sistent response to an item on two occasions. 

The data support the following three hypotheses concerning con- 
sistency of response: (1) P(C) is greater for items with social 
desirability scale values at the extremes of the social desirability 
continuum than for items in the central section; (2) P(C) is nega- 
tively related to the dispersion of the social desirability ratings; 
and (3) P(C) is negatively related to P(X). 

No convincing evidence was obtained to support the hypothesis 
that P(X) is greater for items with neutral social desirability scale 
values than for items with extreme scale values. 

The data do not support the hypothesis that marking an item 48 
doubtful tends to increase the probability of a True response inde- 
pendently of social desirability scale value. Instead, it was found 
that Р(Т/Х) was greater than P(T) for items in the central and 
socially undesirable sections of the social desirability continuum _ 
and smaller than P(T) for items with socially desirable scale values. 
Marking an item as doubtful, in other words, tends to increase the 
probability of a socially undesirable response. 


EDWARDS AND WALSH 237 


REFERENCES 


Cowen, E. L. and Tongas, P. "The Social Desirability of Trait 
Descriptive Terms: Applications to a Self-Concept Inventory.” 
Journal of Consulting Psychology, XXIII (1959), 361-365. 

Cronbach, L. J. “Response Sets and Test Validity.” EDUCATIONAL 
AND PSYCHOLOGICAL MEASUREMENT, VI (1946), 475—494. 

Cronbach, L. J. “Further Evidence on Response Sets and Test De- 
sign.” EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 
(1950), 3-31. К 

Edwards, A. L. “The Relationship between the Judged Desirability 
of a Trait and the Probability that the Trait will be Endors HA 
Journal of Applied Psychology, XXXVII (1953), 90-93. 

Edwards, A. L. “Social Desirability and Probability of Endorsement 
of Items in the Interpersonal Check List." Journal of Abnormal 
and Social Psychology, LV (1957), 394-395. (а) * 

Edwards, А. L. The Social Desirability Variable in Personality 
Assessment and Research. New York: Dryden Press, 1957. (b) 

Edwards, A. L. “Social Desirability and the Description of Others.” 
уза of Abnormal and Social Psychology, LIX (1959), 434- 

Edwards, A. L. “Social Desirability or Acquiescence in the MMPI? 
A Case Study with the SD Scale.” Journal of Abnormal and 
Social Psychology, LXIII (1961), 351-359. 

Edwards, A. L. “Social Desirability and Expected Means on MMPI 
Scales" EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 
(1962), 71—76. 

Edwards, A. L., Heathers, Louise B., and Fordyce, W. E. “Correla- 
tions of New MMPI Scales with Edwards SD Scale.” Journal 
of Clinical Psychology, XVI (1960), 26-29. 

Edwards, A. L. and Heathers, Louise B. “The First Factor of the 
MMPI: Social Desirability or Ego Strength?” Journal of Con- 
sulting Psychology, XXVI (1962),99-100. — — .. 

Edwards, A. L. and Diers, Carol J. “Social Desirability and the 
Factorial Interpretation of the MMPI.” EDUCATIONAL AND Psy- 
CHOLOGICAL MEASUREMENT, XXII (1962), 501-509. 

Edwards, A. L., Diers, Carol J., and Walker, J. N. "Response Sets 
and Factor Loadings on 61 Personality Scales." Journal of Ap- 
plied Psychology, XLVI (1962), 220-225. 

Hanley, C. “Social Desirability and Responses to Items from Three 
MMPI Scales: D, Sc, and K.” Journal of Applied Psychology, 
. XL (1956), 324-328. 7 ^ 

Hillmer, M. L., Jr. “Social Desirability in а Two-Choice Personality 
BU Unpublished Master's thesis, University of Washington, 

Kenny, D. T. “The Influence of Social Desirability on Discrepancy 
Measures between Real Self and Ideal Self." Journal of Con- 
sulting Psychology, XX (1956), 315-318. 

Taylor, J. B. “Social Desirability and MMPI Performance: The 


238 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Individual Case.” Journal of Consulting Psychology, XXIII 
(1959), 514-517. 

Thurstone, L. L. and Chave, E. J. The Measurement of Attitude. 
Chicago: University of Chicago Press, 1929. 

Wright, C. E. *Relations between Normative and Ipsative Meas- 
ures of Personality." Unpublished Ph. D. thesis, University of 
Washington, 1957. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vol, XXIII, No. 2, 1963 


FACTORED SCALES FOR MEASURING 
CHARACTERISTICS OF COLLEGE 
ENVIRONMENTS! 


JUM C. NUNNALLY aw» DONALD L. THISTLETHWAITE 


Vanderbilt University 
AND 
SHARON WOLFE 


University of Illinois 


In 1958 Stern published a preliminary manual for a college 
environments inventory, called the College Characteristics Index. 
The inventory was based upon Murray's (1938) classification of 
needs and contained thirty scales of ten dichotomous items each. 
For each hypothesized need there was a corresponding college press 
scale consisting of items describing pressures or activities in the 
college which might be satisfying to persons with the given type 
of need. Subsequently, Pace and Stern (1958) showed that mean 
press scores at five colleges, based upon undergraduate student 
descriptions of faculty and student pressures, differed markedly 
from campus to campus, and that student deseriptions showed con- 
siderable agreement with faculty descriptions. 

One use of such scales for measuring college environments is to 
systematically describe ways in which learning environments differ, 
and to relate these environmental differences to student perform- 
ance. Knapp and Goodrich (1952) and Knapp and Greenbaum 
(1953) have shown that colleges differ in the proportion of their 
alumni who later attain Ph.D. degrees or who later give evidence 
of becoming promising scholars. Such differences in college "pro- 


——— 

„1 Grateful acknowledgement is given to Lyle H. Lanier, Provost and Execu- 
tive Vice President, University of Illinois, for his support, advice, and partici- 
pation in the study. г 


240 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


ductivity” could reflect the effects of various types of learning 
environments at different colleges. 

Thistlethwaite (1959a, 1959b) attempted to identify features of 
effective learning environments by studying the validities of items 
in the College Characteristics Index. The criterion of item validity 
was the extent to which item responses differentiated between high- 
and low-ranking colleges on a measure of adjusted productivity (a 
residual representing the difference between a college's observed 
rate of "producing" Ph.D.s among its undergraduates and the 
rate predicted on the basis of the estimated aptitude level of its 
incoming freshman classes). In a subsequent study Thistlethwaite 
(1960) grouped items with the highest validities into clusters on 
the basis of content and item-scale correlations, and added new 
items of apparently similar content to each cluster. The new inven- 
tory, called the Inventory of College Characteristics, contained 
eighteen scales, six of which were found to correlate with changes 
in aspirations for advanced degrees among college undergraduates. 
However, the validities of items in each scale varied considerably, 
and it was apparent that the scales were not independent. 

The present study sought to redefine through factor analysis а 
set of independent dimensions which might account for the inter- 
relationships between items in this inventory. A set of factored 
scales, it was hoped, would permit us to reduce redundancy and 


perhaps assist in identifying new dimensions for describing effective 
learning environments. 


Method 


Questionnaire. For the present study two modifications were 
made of the Inventory of College Characteristics. First, the 180 
items were split into two separate questionnaires, one containing 
the 90 items relating to faculty and the other containing the 90 
items relating to students. Second, rather than requiring a simple 
agree-disagree response for each item, each item was presented 
with a seven-step, agree-disagree rating scale. It was hoped that 
the use of a seven-step scale instead of a dichotomy would raise 
the reliability of separate items, and, consequently, of item-com- 
posites. 

Subjects. All subjects for the study were selected from the fresh- 
man and sophomore classes in the College of Liberal Arts and 


JUM C. NUNNALLY, ET AL. 241 


es at the University of Illinois. The total group was divided 
four, approximately equal, randomly-selected subgroups. One 
тоир was given the “faculty” questionnaire. Another subgroup 
given the “student” questionnaire. The other two subgroups 
ver not used in this study. 
~ Questionnaires were sent by mail, accompanied by a mimeo- 
graphed letter from the Dean explaining the purpose of the study 
"and requesting cooperation. The letter stated that the results would 
be completely anonymous and would in no way adversely affect 
‘students. The response was excellent: over 91 per cent of the ques- 
tionnaires were returned and were sufficiently complete to be used 
‘in the analysis. This provided 551 responses to the “faculty” ques- 
tionnaire and 567 responses to the “student” questionnaire. These 
N's were employed, respectively, in all of the statistics which will 
ited later. 
Analysis. Separate analyses were made of the responses to the 
questionnaires. Factor analyses were performed on each of the 
two sets of 90 items. (All statistical work was performed on the 
- digital computer at the University of Illinois.) An arbitrarily large 
—number (twenty) of centroid factors was extracted. (Unities were 
Placed in the diagonal spaces of the correlation matrix preparatory 
| to extracting centroid factors.) All twenty were subjected to 
Varimax rotation on the computer. The two sets of Varimax 
- factors constitute the major results. 
— The most highly loaded items were located for each factor. 
. Finally, reliability coefficients (Coefficient Alpha) were determined 
Tor the groups of items corresponding to the factors. 


Results: Faculty Items 


E In both sets of data the tendeney was for items to break up into 
many small factors rather than to form а few large ones. This is 
“illustrated by the fact that the largest Varimax factor in the faculty 
items explains less than 6 per cent of the total variance, and the 
largest factor in the student items explains less than 5 per cent of 
the total variance. This is not necessarily “bad” (we worried about 
finding only а large general factor due to “halo error”), but it does 
. mean that items of these types tend to evolve into many separate 
. factors, 
. Partly because we found many factors with only а relatively 


242 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


few items relating to each, the reliabilities (Coefficient Alpha) of 

е factors tended to be low. Only those factors will be 
presented which have reliabilities at least as high as .60. Reliabili- — 
ties for factors are presented in Table 1. 


many of th 


Factor Reliability Factor Reliability 
F-I .79 8-1 79 
F-II ТЇ S-II .72 
F-III .67 S-II 78 
F-IV ‚64 S-IV 70 
F-V .64 8-V .60 
F-VI .66 S-VI _ .68 


Loading 
614 


‚597 
—.534 
533 
510 
.507 
A13 


—.421 
410 


Although it їз always hazardous to try to interpret factors, the 
items above seem clearly (to us) to concern the extent to which _ 
the faculty works hard at teaching and does a thorough job. 


Loading 
—.731 


—.674 


TABLE 1 
Coefficient Alpha Reliabilities of Factors 


Factor F-I: Systematized Energy of Faculty 


In most classes the presentation of material is well 
planned and illustrated. 
Instructors are pretty practical and efficient in the — 
way they dispatch their business. 
Many of the instructors seem bored with their teach- - 
ing assignments. 
Assignments are usually clear and specific, making it 
easy for students to plan their studies effectively. 
Instructors clearly explain the goals and purposes of 
their courses. 
Faeulty members put a lot of energy and enthusiasm 
into their teaching. 
Most of the courses stress basic science or scholarship _ 
nS really probe into the fundamentals of their sub- 
jects. 

Some professors here tend to belittle the students. d 
стати really get students interested in their sub- - 
jects. 


Factor F-II: Toughness of Faculty 
Standards set by the professors are not particularly 
hard to achieve. 1 


It is fairly easy to pass most courses without working 
very hard. 9 


—.546 


—.476 


JUM C. NUNNALLY, ET AL. 243 


The professors really push the students’ capacities to 
the limit. 

There aren’t very many courses here which demand 
excellent performance by all students. 

The student can pass most courses just by learning 
what's in the textbook. 


The meaning of the item content seems so clear that no elaborate 
argument need be made for the interpretation. 


Loading Factor F-III: Availability of Faculty to Students 


—.564 


—.547 


547 


542 


—.497 


—.493 


Faculty members are available to students only dur- 
ing scheduled office hours. 

The professors seem to have little time for conversa- 
tion with students. 

Professors frequently go out of their way to establish 
friendly relations with students. ines 
There are many facilities and opportunities for indi- 
vidual creative activity. 

The campus atmosphere does not seem to be very 
stimulating for faculty members. $ 

There is little opportunity here for pursuing indepen- 
dent study under the supervision of faculty members. 


The meaning of this factor also seems clear. 


Loading Factor F-IV: Interestingness of Lectures 


—.646 


—.561 


313 


Lectures are frequently routine and duplicate material 
in the text. 

Personality, pull, and bluff get students through many 
courses. АМ 
Many lectures are delivered in a monotone with little 
inflection or emphasis. 

Professors typically exhibit great interest and enthu- 
siasm in their subjects. А 
The professors really talk with the students, not just 
at them. 


All of the items clearly fit the interpretation with the exception 
of the second one listed, “Personality, pull, and bluff, ete.” The 
loading of the item is sufficiently high to strongly suggest that the 
item actually belongs with the factor, but “why” it belongs we are 


Not sure. 


244 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Loading 
—.698 


614 
—.572 
—.490 

.469 


Factor F-V : Faculty Interest in Arts and Humanities 


Very few of the professors here try to get students 
interested in the humanities. 

The faculty encourages the student to take courses 
in the social sciences and humanities. 

Instructors have very little interest in drama or the 
arts. 

Advisors seem unaware that a well-rounded program 
of study includes courses in the arts and humanities. 
Student interest in understanding and criticizing im- 
portant works in art, musie, and drama is encouraged 
by the faculty. 


There seems no need to argue the interpretation of this factor. 


Loading 
—.620 


—.610 


548 


— 402 
—.280 


Factor F-VI: Vocational Emphasis 


Very few of the courses here will be useful to students 
who go into business or industry. 

The academic atmosphere is not very helpful to the 
student who wants to specialize in business, engineer- 
ing, management or other practical affairs. 

The university offers many really practical courses 
designed to prepare the student for his occupation. 
It is difficult to take clear notes in most courses. 
Very few instructors try to give the student the prac- 
tical training he will need in his career field. 


This is also a very clear-cut factor. 


Results: Student Items 


To refresh the reader’s memory, twenty Varimax factors were 
obtained for the “student” items. Following are the factors that 
have reliabilities of at least .60. 


Loading 
669 


686 
554 
509 
501 


Factor 8-1: Intellectual Drive of Students 


Long, serious philosophical discussions are common 
among the students. 

Books dealing with psychological problems of personal 
values are widely read and discussed. 

Most students here have strong intellectual commit- 
ments. 

There is a lot of interest in the philosophy and meth- 
ods of science. 

There is a lot of interest here in poetry, music, paint- 
ing, sculpture, architecture, ete. 


—.470 
456 
421 
All 


All 


JUM C. NUNNALLY, ET AL. 245 


When students get together they seldom talk about 
trends in art, music, or the theater. 

Students spend a lot of time planning their intellectual 
careers. 

There are lots of informal social groups in which 
people really enjoy listening to witty conversation. 
A controversial speaker always stirs up a lot of stu- 
dent discussion. 

Concerts and art exhibits always draw big crowds of 
students. 


All of the items apparently concern the extent to which students 
voluntarily join in “intellectual” discussions and pursuits out of 


class, 


Loading 
—.749 


—.729 


668 


810 


Factor 8-11: Personal Appearance and Manners 


Very few students are concerned about being properly 
groomed. д 

Students here don't seem to care about their personal 
appearance. ў 
Students think about dressing appropriately and in- 
terestingly for different occasions—classes, social 
events, sports, and other affairs. 

Proper social forms and manners are important here. 


The factor seems to be almost exclusively concerned with personal 
appearance and the social amenities. If this factor proves useful in 
future research to differentiate between colleges, or between schools 
within them, it may relate to the pressures exerted by some student 
organizations with respect to these functions. 


Loading 


458 
697 
645 


522 
—.521 


—.422 


Factor S-III: Competition 


The competition for high achievement is intense. 
The competition for special honors is very tough. 
The high calibre of students here puts а lot of pres- 
sure on one. 

There is a great deal of rivalry for grades among 
students. 

There is really very little ae in 
competition to contend with. 7 
It is relatively easy to win scholarship awards here. 


the way of student 


The interpretation is quite obvious. 


246 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Loading 
—.669 


—.629 
622 


—.587 
454 


428 


Factor S-IV: Science Interest 


When students get together they seldom talk about 
science. 

Very few students here prefer to talk about science 
or mathematics as opposed to poetry or politics. 
Science and math are the dominant interests of most 
of my friends here. 

Few students are planning careers in science. 

There is a lot of interest in the philosophy and meth- 
ods of science. 

Science is the most appealing way of life for a ma- 
jority of the students here. е 


The interpretation here also is obvious. 


Loading 
.633 


605 
519 
812 


805 


Factor S-V: Pressure Against Scholarly Activities 


A student who insists on analyzing and classifying art 
and music is likely to be regarded as a little odd. 

A student whose interests are confined to only a single 
area is likely to be regarded as a little odd. 

A student who spends most of his time in a science 
laboratory is likely to be regarded as a little odd. 
Students who are concerned with developing their own 
personal and private system of values are likely to be 
regarded as odd. 

Most students have very little interest in round tables, 
panel meetings, or other formal discussions. 


The reader may be relieved to find that at least one of our factors 
is not sparkingly clear. What apparently underlies all of the items 
is a pressure from students as a whole to isolate students who are 
intensely committed to academic and intellectual pursuits of any 
kind. The factor does not relate necessarily to lack of social partici- 
pation. If it did, we would have found some (of the many) items in 
the questionnaire relating to social participation with negative load- 


ings on the factor. 
Loading Factor S-VI: Interest in Visiting Speakers 
—.643 А lecture by an outstanding literary critic would be 
poorly attended. 
603 There would be a capacity audience for a lecture by 
an outstanding philosopher or theologian. 
—.565 A lecture by an outstanding scientist would be poorly 


attended. i 


JUM C. NUNNALLY, ET AL. 247 


501 ^ Students respond enthusiastically to a colorful and 
dramatic speaker. ` 

317 A controversial speaker always stirs up a lot of stu- 
dent discussion. 


All of the items apparently concern attendance at and reaction to 
out-of-class speeches and lectures. The content appears to be so 
narrowly circumscribed that it is doubtful that the factor will prove 
to be a highly important yardstick in future research, 

Discussion 
* The net gain from our studies is one-dozen factors relating to 
student perceptions of college environments. The factors are offered 
to researchers for use in future studies of college environments. 
Several suggestions will be given about how to use the scales and 
about how the scales could be improved in future research. 

Some of the factor reliabilities (see Table 1) are not as high as 
one would like. The reliabilities are sufficiently high to make com- 
parisons among large groups, eg. large segments of the student 
body in different colleges. If need be, the reliabilities of the factors 
could be raised by adding items. Most of the factors are so clear in 
meaning that it should be relatively easy to construct more items 
relating to the respective factors. If the number of items for each 
factor was increased to about twelve, this should raise all of the 
reliabilities to quite respectable figures. 

From this analysis many of the 180 items appear to concern 
rather specific issues which do not strongly relate to more general 
perceptions of the faculty and student press. This result could be 
partly due to the fact that we have sampled students at only one 
university. By sampling observers at different colleges and universi- 
ties we might well find that students at different colleges have rela- 
tively consistent and general images of teachers and fellow students, 
and that these images differ markedly from campus to campus. Thus 
intercorrelations between items would be greater if inter-university 
differences in college press are relatively large compared with inter- 
individual differences in perceptions of the press at individual col- 
leges. If this is the case, the reliabilities of the factored scales would 
be greater than we have estimated from this analysis. 

‚ Of course, there must be other factors relating to student percep- 
tions of college environments, ones that we did not find in our 


248 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


study. Hypotheses about new factors could be tested by (a) the 
construction of related items, (b) the administration of these along 
with items relating to some or all of our dozen factors, and (c) 
factor analysis of the results to see if such additional factors ac- 
tually exist. Ё 

In future studies it is recommended that the seven-step rating 
scale be retained rather than reverting to the dichotomous responses 
obtained in previous studies. Otherwise, the reliabilities surely 
would drop markedly. 


Summary 


The article reports a factor-analytie study of students’ percep- 
tions of college environments. Employed in the study were the 180 
items of the Inventory of College Characteristics. The items were 
divided into two groups of ninety each, one group containing those 
items relating to students’ perceptions of fellow students and the 
other group relating to perceptions of faculty. Each group of items 
was administered to over 500 freshman and sophomore students at 
the University of Illinois. Separate factor analyses were made of 
the two groups of items, resulting in six major factors for each. 
Efforts were made to interpret the factors and to make recommen- 
dations for their use in future studies of college environments. 


REFERENCES 


Knapp, R. H. and Goodrich, H. B. Origins of American Scientists. 
hicago: University of Chicago Press, 1952. 

Knapp, В. Н. and Greenbaum, J. J. The Younger American Scholar: 

үе Origins. Chicago: University of Chicago Press, 


Murray, H. A. Ezplorations in Personality. New York: Oxford 
University Press, 1938. 

Pace, C. R. and Stern, С. С. “Ап Approach to the Measurement 
of Psychological Characteristics of College Environments." 
Journal of Educational Psychology, XLIX (1958), 269-277. 

Stern, G. G. College Characteristics Index: Preliminary Manual. 

n шеи: SX 1958. 
istlethwaite, D. L. “College Environments and the Development 
of Talent.” Science, CXXX (1959), 71-76. (a) 8 

Thistlethwaite, D. L. “College Press and Student Achievement.” 
Journal of Educational Psychology, L (1959), 183-191. (b) 

Thistlethwaite, D. L. "College Press and Changes in Study Plans 


of Talented Students." Journal of Educational hology, LI 
(1060). а f Educational Psychology. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vol. XXIII, No. 2, 1963 


„э 


THE EXPERIMENTAL CONDITIONS FOR MEASURING 
INDIVIDUAL DIFFERENCES! 


DONALD W. FISKE Ax» JOHN M. BUTLER 
University of Chicago 


SzvemAn years ago the authors considered briefly some of the 
problems involved in the measurement of ability, in contrast to 
those involved in the measurement of personality (Butler & Fiske, 
1955). In the intervening period we have become even more con- 
vinced of the necessity for reconsideration and re-examination of 
the basie conditions under which individual differences in these 
domains are assessed. Since much of the technical work in this part 
of psychologieal measurement has slighted or neglected these con- 
ditions, their restatement seems required. While many of the points 
to be made have been noted before (e.g., Anastasi, 1948; Bergman 
& Spence, 1944; Campbell, 1957; Coombs, 1956, 1960; Goodenough, 
1949; Levinson, 1946; MeQuitty, 1942; Sarason, 1954), they seldom, 
if ever, have been considered concurrently. 

Measurement has been informally described by Stevens (1959, 
p. 18) as “the business of pinning numbers on things.” Even the 
more formal definitions, such as N. В. Campbell’s classic formula- 
tion (1928) in terms of the assignment of numerals in accordance 
With certain rules, emphasize the rational process rather than the 
empirical operations. Standard expositions of classical test theory 
(e.g., Gulliksen, 1950) usually begin with a set of responses rather 
than with a subject in a test situation. The systematic integration 
of psychological measurement provided by Coombs’ “Theory of 


EE OE: 

1 For their helpful criticisms of earlier drafts of this paper, we аге indebted 
to Harold Р. Bechtoldt, Benjamin 8. Bloom, Donald Т. Campbell, Desmond В. 
Cartwright, Lee J. Cronbach, Charles Dicken, Lyle V. Jones, Salvatore Maddi, 
Hobart Osburn, and Jack Sawyer. 


249 


250 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Data" (1960) goes back ‘only to the formal analysis of the sub- 
ject’s task as indicated by the instructions. There remains the prior — 
matter of the conditions under which “the observations are to be 
made. A reconsideration of these conditions brings out the close 
interdependence, if not the fundamental unity, present in “The Two 
Scientific Disciplines of Psychology” (Cronbach, 1957). In the 
ideal case, each ascertains the contribution of one variable to the 
observed variance in another variable, with other variables held 
constant, randomized, or otherwise treated so that their influence 
is minimized or at least can be evaluated. In correlational psychol- 
ogy, the experimenter determines the association between one char- 
acteristic of the subjects or one aspect of their behavior and their 
responses to a set of stimuli and, in principle, makes the stimuli and 
the testing conditions constant not only objectively but also sub- 
jectively across subjects. Thus we are arguing that psychometrics, 
no less than experimental psychology, requires careful specification 
and control of the empirical conditions. 

A complete discussion of the measurement of individual differ- 
ences would include not only an analysis of the conditions under 
which the subject makes his responses but also a consideration of 
the prior procedures for selecting appropriate stimuli and of the 
methods for subsequently deriving scores from these responses. For 
the sake of both emphasis and brevity, this paper will focus almost 
entirely on the situation in which the subject interacts with the 
stimuli or items. 


The Conditions for Measuring Ability 


The usual instructions for a test of ability can be paraphrased 
as follows: 


Here is a test of your ability. Give the right answer to each 
question (problem, or item). Try to do your best, but do not 
expect to be able to answer correctly all of the questions. 


What does the examiner or experimenter imply by these instruc- 
tions to the subject? “Here is a test. You know what a test is. You 
have taken many before. You know you are to work quietly by 
yourself, and not help others or get help from others.” These impli- 
cations are familiar to children and adults in our American culture, 
but people in many other countries have not been so indoctrinated. 


тали 


` ‘FISKE AND BUTLER . EEF 3, 


The instruction to “give the right answer” (or “the best answer") 


- needs no elaboration for individuals subjected to standard educa- 


tional experiences. They ‘accept the notion that, in contexts such as 


this, each problem has a right ‘answer; an answer on which experts 


would show a consensus. The important point is that the correctness 
of the response is determined by operations which are independent 
of both subject and experimenter. The latter cannot capriciously 
identify a response as right by fiat. 

In the typical test of ability, the nature of the problem is im- 
mediately evident to the subject. In tests of arithmetical opera- 


. tions, both the content and the function are familiar. In analogy 


and vocabulary tests, the content is familiar, at least up to а point, 
and the subject is acquainted with or can easily grasp the func- 
tion he is to perform. Ability is typically estimated from perform- 
ance at a crude limit of learning, as Ferguson (1954) has noted. 
The function, the content, or both have been overlearned in past 
experience. 

The next part of the instructions, “Try to do your best,” hardly 
needs to be said. All but an occasional deviant subject will try to 
give as many right responses as possible. The examiner’s goal is to 
produce optimal motivation in each subject. Even before the for- 
mulation and description of the inverted-U curve for performance 
as a function of activation or arousal (cf. Hebb, 1955; Duffy, 1957; 
and Malmo, 1959), psychological examiners realized that subjects 
should be motivated, but not so extremely that anxiety and other 
emotional states would disrupt performance. Fortunately, there 
appears to be a broad range of motivation levels within which maxi- 
mum performance occurs. Therefore the performance of any one 
subject on an ability test can ordinarily be compared with that of 
any other subject, with the difference being attributable to the only 
factor which is allowed to vary, the abilities of the subjects. 

The final part of the paraphrased instructions, “. . . but do not 
expect to be able to answer correctly all of the questions," is in- 
cluded primarily to reassure the subject at the point where he en- 
counters items that are very difficult or impossible for him to get 
tight, Implicit in this section, however, is the notion of difficulty, 
a term operationally defined by the proportion of subjects failing 
each item. Difficulty provides a basis for ordering items so that 


‘the typical subject will be able to make correct responses to the 


252 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


first items in the test, but will come to a point beyond which he 
makes no, or almost no, correct responses. While difficulty is meas- 
ured in terms of the proportion of а group who fail each item, an 
ordering of items on the basis of group diffieulty usually estimates 
closely the ordering of items for each subject in terms of the prob- 
ability of his making the correct response. 

The crucial aspect of the concept of difficulty, for the present 
discussion, is the fact that it refers to a limit. If 40 per cent of a 
group fail an item, it is assumed quite correctly that no matter how 
hard the group tries, the proportion failing the item will not be 
reduced. Similarly, when a subject makes incorrect responses to 
25 per cent of the items on a test, it is assumed correctly that this 
is his limit, that he is and will be unable to give correct answers 
to these items, except as a consequence of practice, further experi- 
ence, or chance. 

Even for a highly speeded test in which the items are very easy 
and essentially of zero difficulty, the same interpretation holds. On 
a Digit-Symbol Test, ability is measured in terms of the speed with 
which the subject responds. His limit is the number of items to 
which he can respond correctly in the time allowed. Here also it 
is assumed that he cannot better his performance except perhaps 
as a result of such extraneous factors as practice on this particular 
type of item. (Of course a subject should be told whether he should 
emphasize speed or power: the same test materials may not measure 
the same behavior under speeded and under power conditions. Cf. 
Mollenkopf, 1960.) 

A fundamental aspect of the conditions under which ability is 
estimated is the nature of the relationship between the experimenter 
and the subject. The experimenter does what he can to make the 
setting seem most favorable for maximal performance. As indicated 
earlier, he tries to avoid arousing anxiety in the subject. He also 
attempts to eliminate distractions even though these have typically 
been found to have little or no objective effect on performance. 

It is obvious that the examiner is in charge of the situation. A 
prerequisite condition for measuring ability is the subject’s accept- 
ance of this fact and his agreement to carry out the task set for him 
by the examiner, i.e., his acceptance of a role, He understands that 
the examiner is measuring his ability. But more important is his 
implicit understanding of the fact that the examiner wants him to 


FISKE AND BUTLER 253 


do as well as he can, and has arranged the conditions to make this 
possible. Both are, in effect, working together so that the subject 
can reach his limit of performance. Їп spite of their different roles, 
the purposes of the two people are consonant and harmonious. ~ 

These conditions yield comparable measures of the maximum 
performance of each subject. We choose to assess performance in 
terms of its maximum for two reasons: First, we want a pure meas- 
ure, one that is determined almost wholly by one thing, the subject’s 
capacity, rather than a measure which is affected by several influ- 
ences, Second, we measure maximum performance because it is 
probably more stable than performance under more lifelike condi- 
tions. In everyday life, performance is often impaired by insufficient 
or conflicted motivation. It seems obvious that the ebb and flow 
of the strengths of motives produces a variable internal field, 
within the subject, which contributes to fluctuation in performance. 

In the ideal case, all the conditions of the experiment have been 
constant for all subjects, and hence the inherent ability of the 
several subjects is the only factor which can account for the differ- 
ences in the resulting scores. Such an ideal situation is essentially 
that of the classical experimental design in which only one inde- 
pendent variable is allowed to vary and hence the systematic 
variance in the dependent variable can be attributed to it. 


Operations for Measuring Personality 


A moment’s reflection makes it immediately apparent that the 
experimental operations for measuring personality are different 
from those for measuring ability. We ordinarily do not say “Here 
is a test of your personality” because the notion of a personality 
test does not have clear-cut common connotations for subjects, even 
in the test-wise American culture, (If anything, it would connote 
à procedure to detect deviancy or pathology, and consequently 
Would arouse anxiety.) We do not say “Give the right answer” but 
often we say “There are no right or wrong answers," meaning an- 
swers which are right for every subject. We do not say “Try to do 
Your best” because the subject is not working toward a limit. While 
the subject's responses are a product of his past experience, he is 
not exercising a well-practiced function. The examiner does not 
try to motivate the subject to maximize the extent to which his 
Tesponses agree with some external criterion. The motive or goal 


254 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


which the examiner seeks to produce in the subject is not à common 
one which the subject has frequently experienced. The concept of 
ап inverted-U relationship between quality of performance and 
strength of motivation is inapplicable. The notion of difficulty is 
not relevant: the proportion of subjects who “fail” an item (i.e. 
give a response classified in a partieular way by the examiner) 
ean ordinarily be altered readily by changing the instructions or 
other conditions, and this proportion сап be decreased more easily 
than can difficulty values for ability items. 

Finally, there is the crucial fact that the purposes of the exam- 
iner and subject are no longer harmonious. The examiner does not 
tell the subject the variable being measured, to avoid the possible 
effects of this knowledge upon the subject’s responses. Although the 
subject ordinarily cooperates to the extent of making some re- 
sponses, the experimenter rarely has evidence to justify the infer- 
ence that the criteria used by the subject in selecting his response 
are solely those desired by the experimenter. The subject is not 
familiar with the role which the experimenter desires him to accept. 

It is, of course, true that the variables of personality are more 
or less practiced ones; certainly some neurotic tendencies are over- 
learned. However, many personality tests do not elicit neurotic 
tendencies in the experimental situation but rather ask a subject 
to report his actions or feelings. Such reports are ordinarily not 
highly practiced. Even in those testing situations in which it might 
be argued that a subject is manifesting such a tendency in his re- 
sponses, the conditions are not familiar ones and the several sub- 
jects are not all utilizing the same practiced function. 

These differences between the conditions for measuring ability 
and achievement and those for measuring personality help to ex- 
plain three empirical generalizations: As compared to ability scores, 
personality scores typically are less stable over time; they have 
lower internal consistency; and they have lower correlations with 
other behavior when it is assessed by measures which are experi- 
mentally independent. 

Although ability measurement seems definitely ahead in this 
comparison, it has some of the same problems to a lesser degree. 
Ward Edwards (1961) has recently analyzed the instructions for 
psychological experiments and tests in terms of decision theory. 
He points out that, even in ability tests, the instructions typically’ 


FISKE AND BUTLER i 255 


are ambiguous or internally contradictory, and a subject is not 
able to determine an optimal strategy for maximizing his value 
measure. This condition is, of course, much worse in personality 
testing. 2 

In the broad general field of personality, the conditions most 
closely approximating those for measuring ability are the condi- 
tions for certain tests of temperament. An example is a color-form 
test in which the subject is shown a motion-picture film constructed 
so that the apparent movement is in one direction when determined 
by the form of the stimuli and in the opposite direction when 
determined by the color (cf. the test developed by E. H. Hess 
and used by Sinclair, 1956). The subject is simply asked to report 
the direction of the apparent movement, and has no reason to do 
otherwise. He is, of course, unable to tell whether his perception 
follows the color or the form. Even a test of this type is not entirely 
satisfactory, because some subjects show tendencies to respond pri- 
marily in particular directions, such as downward, regardless of the 
color and form (ef. Thurstone, 1953). It remains to be established 
whether such tendencies emerge only in subjects whose tendencies 
to follow color and form are of approximately equal strength, or 
whether these directional tendencies are stronger than and block out 
the tendencies to follow color or form. 

Many such “objective” tests have been devised (cf. Thurstone, 
1953), but few if any have been adequately developed and studied 
(cf. Krathwohl & Cronbach, 1956). The most promising types would 
seem to be those utilizing operations similar to those used in meas- 
uring ability, The analysis of the kinds of errors а subject makes 
on ability tests has long been recognized as а possible source of 
information about personality. Another approach determines the 
decrement in performance associated with a change in the form 
of item (e.g, the Stroop color-word test). A third involves what 
appears to be an ability test, but more than one answer is correct 
and the subject’s score is based on the type of correct answer he 
gives. While such tests seem valuable, it is unlikely that all impor- 
tant dimensions of personality can be elicited and assessed in such 
restricted contexts. 

In the last decade, considerable effort has been devoted to the 
Approach in which the examiner provides stimuli which evoke а 
Motive or need, the strength of which being determined for each 


— 


256 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


subject from the character of his responses. Most work has been 
done on the achievement motive, an appropriate choice for the 
American culture (cf. McClelland, Atkinson, Clark & Lowell, 1953). 
While reasonably good agreement between scores has been obtained, 
the internal consistency is clearly lower than that for the typical 
test of ability, and the generality of the measures across different 
sets of stimulus-pietures has frequently been disappointingly low. 
It is not surprising that the highest reliabilities have been found 
for more structured pictures, for those clearly depicting scenes in- 
volving achievement (cf. Atkinson, 1960, p. 273). 

Probably more subjects have taken questionnaires or inventories 
than any other type of personality test. These are still widely used 
although some of their potential defects have been known for à 
quarter of a century. While these procedures occur in various forms, 
the most common one asks the subject to select, from the two or 
more alternatives provided, that response which best describes him, 
his actions, his feelings, or his beliefs. The task presented to the 
subject is а rather unfamiliar one. Certainly he has no external 
criterion, such as the notion of a “right” answer on which experts 
would agree, to guide him. Even more crucial is the fact that the 
examiner cannot capitalize on an essentially universal motive: he 
cannot establish conditions which evoke a common motivational 
state in all subjects, a state characterized by a single quality, 8 
single goal, a state uncomplicated by the presence of other motives 
involving divergent or incompatible end-states. 

In the last dozen years, work on response sets (Cronbach, 1946, 
1950) has revealed the operation of specific dispositions which un- 
fortunately vary in strength from subject to subject. This work 
makes it quite clear that the same experimental operations produce 
different patterns of motives in different subjects. The same task 
is interpreted in different ways. Hence the scores of the several 
subjects are not comparable. The more unstructured the instructions, 
the more response sets operate. 

Finally, there is the diverse class of projective techniques. These 
are typically the least structured of all “tests”: although а task 
is set for the subject, it provides no criteria by which he can choose 
his responses from among the very large number of possibilities. 
The scores or measures derived from these techniques are more 
ipsative than normative: they are not strictly comparable across 


FISKE AND BUTLER 257 


subjects. If a subject makes many responses of one type, he is more 
or less limited in the number of responses that he can make of other 
types. While this experimental dependence between scores is recog- 
nized and taken into the account in clinical applications of these 
instruments, it seriously limits the possibility of measuring single 
variables in a fruitful way. 

Since projective tasks can be interpreted by subjects in different 
ways (ef. Schachtel, 1945), there is a possibility of identifying 
groups of subjects on the basis of their interpretation. Clusters or 
types can also be determined from common response patterns in 
other kinds of instruments. (Cf. MeQuitty, 1961; Sawyer & Nosan- 
chuk, 1960; and the approach developed by Butler: Butler, 1953; 
Butler, Wagstaff & Rice, 1960.) While this approach has been 
shown to be fruitful, given the present state of methodological de- 
velopment, it is cumbersome and its classifications are frequently 
empirical and ad hoc. As it becomes possible to do so, these classi- 
fications should eventually be superseded by operations locating 
individuals along each of several continua associated with general 
constructs. 

This brief survey has been by no means exhaustive. It has omitted 
Several types of instruments, especially interest tests. These yield 
highly reliable and amazingly stable scores of considerable predic- 
tive value. Insofar as different patterns of interests reflect different 
Motivations, interest tests might seem to deny our argument. They 
Point up, however, the crucial distinction between motivation in the 
Measurement situation and individual patterns of motives: subjects 
taking interest tests are ordinarily asked to indicate their prefer- 
ences, the content being such that they are generally willing to 
follow this instruction, Thus, whatever the diversity of their inter- 
ests and motives, the subjects are homogeneous with respect to the 
Motive of indicating their preferences. 


The Two Central Problems in Measuring Individual Differences 


The preceding exposition has developed the thesis that, to make 
Possible the comparison of scores obtained by different individuals, 
the experimental conditions must be constant so that the inherent 
attribute (ability or behavioral disposition) of the subject is the 
only factor allowed to vary. The examiner has the central problem 
of ensuring that the conditions produce a common orientation to- 


258 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


ward the task set. In other words, there must be a highly “structured: 
test situation” (Secord, 1953), one in which the person-to-person 
variability of the definition of the test situation is reduced to 8 - 
minimum. The motivation of the several subjects must be of the 
same quality—ordinarily a single motive should predominate and. 
should be of approximately equal strength in all subjects. This con- 
dition ordinarily occurs when the task involves a function with 
which each subject is familiar, or at least a function which is read- 
ily grasped on the basis of previous common experience. f 
The other central problem is that of the generalizability of the 
scores obtained by experimental operations established to measure 
psychological variables. Narrowly viewed, it is the problem of the 
extent to which scores based on one set of stimulus-items can be 
reproduced with another set of items. The evidence is overwhelming 
that reproducibility is typically higher in tests of ability than in 
tests of personality. A major factor is the difference between the 
formulations of the two types of variables. An ability is ordinarily 
construed as the best performance on the specified function of which 
the subject is capable, i.e., under most favorable conditions. A per- 
sonality trait is a tendency to respond in a given way. We are 
ordinarily concerned with the typical (modal or mean?) strength 

of this tendency because this provides the best estimate of what а 

person is most likely to do. We are rarely concerned with the maxi- 

mum extent to which a person can be talkative or hostile! Thus 

the measurement problem for ability is considerably easier than 

that for personality. The former seeks to determine a limit of Ca 

pacity, a maximum; the latter seeks to estimate some representative 
value of the strength of a tendency, even though it is clear that 
there is great variation in such strength over time and over con- 
ditions. 

More broadly viewed, there is the problem of the generalizabil- 
ity of scores to situations which differ from the experimental condi- 
tions utilized to obtain the test score. Strietly speaking, this matter 
is outside the province of measurement itself, It is, however, fundat 
mental to the scientific usefulness of the particular measurem 
operations. / 

Perhaps we expect too much of personality measures. Just 88 
a measure of ability is presumed to indicate what an individual 
can do but not necessarily what he will do, so а measure of a wa | 


ё 


FISKE AND BUTLER 259 


might best be taken as an estimate of what a person is most likely 
to do or seek in a narrowly delimited situation, rather than as a 
characteristic of a wide range of his behaviors. 

The problem we are considering can be viewed as the represen- 
tativeness of the testing situation, to borrow Brunswik's term 
(Brunswik, 1947): To what extent is this set of experimental con- 
ditions representative of, or typical of, the conditions to which 
we wish to generalize? We believe that generalization is possible 
only to situations which have for the subject the same meaning 
as the testing situation. Thus an ability score, or more exactly, a 
person’s relative position in a group, can be used to predict rela- 
tive performance on everyday tasks requiring the given ability (or 
perhaps very similar abilities), provided that the latter tasks evoke 
motivation similar in kind and extent to that produced by the test- 
ing situation. Similarly, tests of ways of perceiving things or ways 
of doing things can provide a basis for estimating reactions in 
situations of similar importance or meaning for the subject. In 
contrast, the extent to which measures of personality can be gen- 
eralized seems quite limited. There are few situations closely re- 
sembling the minimally structured test situation in which a disin- 
terested examiner asks the subject to react to unfamiliar objects or 
to attempt to describe himself. Moreover, for a single subject, there 
is little consistency of behavior over time in a given unstructured 
situation, the consistency becoming greater as the particular situa- 
tion has more structure and constrains behavior to a greater degree 
(cf. Fiske, 1961, pp. 326-354). 

Discussion 

In measuring personality, one major alternative strategy is to 
Adapt still further the classical model of psychometric theory for 
assessing ability or aptitude. In essence, this approach involves 
attempting to make the conditions constant for all subjects so that 
the only source of variance is differences between subjects in the 
Personality characteristic being assessed. The rationale for this 
orientation might be based upon Coombs’ argument that the data 
of mental tests and neurotic inventories have the same formal char- 
acter and the same measurement model, even though the point 
Corresponding to the difficulty of a mental test item is more stable 
Over individuals than that corresponding to а questionnaire item 


260 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


(Coombs, 1960, pp. 148-149). Such a point of view might incor- 
porate Cattell’s position that any behavior has three aspects: abil- 
ity, temperament, and dynamics (Cattell, 1946, Chapter 7). Thus 
in measuring an ability trait, one tries to minimize the contribution 
of temperament and dynamics. Similarly, in measuring a person- 
ality variable, one would minimize effects from abilities and tem- 
perament. Although such an effort may be largely successful, it does 
not meet the problem of eliminating contributions from personality 
traits other than the one in which the experimenter is primarily 
interested. 

In his book on Personality (1959), Guilford frankly states the 
matter of multiple determination of behavior. He urges the meas- 
urer to view a test as an experiment in which the effects of the 
situation and of organic conditions are controlled or standardized 
so as to maximize the contribution of traits to test scores. While 
he recognizes many of the limitations of tests that have been men- 
tioned or implied earlier in this paper, he seems to us to give insuf- 
ficient attention to the problem of the noncomparability of the 
scores for different subjects as a function of unavoidable differences 
in the contribution of irrelevant determinants. Similarly, while he 
indicates that the factorial composition of a test may be complex, he 
still seems to view the factorial analysis of data from a group of 
subjects as an adequate approximation to the relative contribution 
of the several factors to the score of each individual, again neglect- 
ing the probable idiosynerasy of the pattern for each specific sub- 
ject. 

Thus one strategy which is open to the experimenter attempting 
to measure a specific personality variable is to try to make the 
conditions strictly comparable for all subjects so that the actual 
differences among subjects on that variable will account for the 
great majority of the variance in the test scores. For this purposes 
we must develop and empirically study new types of instructions 
and new kinds of item material. In recent years, much research has 
attempted to achieve the same goals by separately measuring and 
partialing out the contributions of extraneous variables, but we see 
little hope of exceeding the very modest level of success which has 
been obtained up to this time. One important branch of this attack 
has been the development of such correction scores as K on the 
MMPI. While the separate measurement of dispositions irrelevant 


FISKE AND BUTLER 261 


to the trait being assessed certainly improves the situation, we still 
do not have assurance that corrected scores are strictly comparable, 
that they reflect differences in only the one trait under consideration. 

One variant on this major strategy deserves more investigation 
than it has previously received. Just as we measure ability in terms 
of the subject's maximum potential for carrying out a partieular 
intellectual function, rather than in terms of the level of his per- 
formance under everyday conditions of incentive and possibly con- 
flicting motivations, so we might attempt to determine the maximum 
strength of a personality disposition under the most favorable con- 
ditions, that is, under conditions which elicit or arouse this disposi- 
tion to the greatest degree. Practical, cultural, and ethical consid- 
erations prevent the use of this approach for many traits. Yet it 
should be possible to utilize it for certain socially acceptable or 
socially desirable traits, such as congeniality, sociability, coopera- 
tion, and perhaps even dominance and initiative. 

The utilization of this general strategy would require some pro- 
cedure for determining the degree to which the experimenter was 
justified in his assumption that the conditions were essentially 
comparable for all subjects and that the test performances were 
determined almost exclusively by some one trait. Some technique 
for assessing homogeneity, such as scalogram analysis, should be 
employed. 

Furthermore, it must be recognized that this strategy involves 
rather artificial conditions, atypical of everyday life outside the 
testing room. While it is of considerable practical importance to 
know the maximum intellectual potential of an individual, the ex- 
Perimenter measuring personality is, of course, much more con- 
cerned with estimating the typical strength of a personality disposi- 
tion than its maximal value or its strength under such atypical 
conditions as the measurement situation. Moreover, under normal 
“reumstances, different traits and different motives are often in 
conflict with each other at any one point in time, во that their rela- 
tive strength may be of greater interest than their absolute strength 
When measured in isolation from any other type of stimulation. 

This first strategy aims at the elimination of all effective stimuli 
except a set selected to elicit reactions of a particular type. The 
alternative strategy is more realistic but may be less feasible. It 
Involves seeing what personality dispositions emerge when subjects 


262 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


are confronted with relatively neutral stimuli, or at least with 
stimuli which do not elicit the same disposition in each subject. 
Projective techniques will serve as an illustration of this orienta- 
tion. They provide rather ambiguous stimuli in a rather unstruc- 
tured situation and allow a diversity of traits to emerge. In this 
strategy, the disposition is generated or produced mostly by the 
organism with the stimulus playing a minor role, in contrast to the 
other approach which seeks to control the kind of disposition 
elicited, with the strength being determined by the characteristics 
of the subject. 

The experimenter following this strategy must settle initially 
for classifications of subjects, rather than for an ordering along а 
single continuum. Having collected the subjects’ responses to a 
stimulus, he groups the persons in terms of the way in which they 
interact with it, perhaps in terms of the motives it elicited. By care- 
ful planning and by trying out a variety of stimuli, the experi- 
menter may be able to find a set of stimuli, each one of which 
yields responses that can readily and exhaustively be categorized 
into useful classes. In order to obtain dependable data, he must 
still cope with the problem of reducing the categorizations for each 
stimulus into a single classification for the testing instrument as а 
whole, or must follow the more difficult alternative of developing 
a scheme for classifying total protocols. 

This strategy has the advantage of being closer to the normal | 
functioning of subjects. Also, it has the promising possibilities asso- 
ciated with scaling the categories along one or several dimensions. 
On the other hand, the extensive efforts to deal with such content 
as projective protocols have met with only limited success from 8 
psychometric point of view. Even if the classification can be made 
objective and still be psychologically significant, categorical data 
are more laborious to analyze than scores forming an ordered series. 
In addition, responses to unstructured stimuli and situations tend 
to be less consistent over time, so that the stable classification of 
subjects would require repeated testings. 

Since the objective assessment of personality is, relatively speak- 
ing, in its infancy, there appear to be a number of problems for 
which this second strategy provides an appropriate approach for 
psychologists on the contemporary scene. This view is supported by 
the fact that the alternative approach of measuring single variables 


FISKE AND BUTLER 263 


under fully controlled conditions cannot be fully implemented to- 
day for most aspects of personality. 

Furthermore, the two strategies can be used in sequence. Pro- 
tocols of behavior in relatively free, non-test situations (such as а 
therapy hour) can be analyzed to determine the frequencies of 
various categories. From the degree of association between the pro- 
files or patterns of relative frequencies for pairs of samples, factors 
indicating clusters of samples can be obtained (Butler, Rice & Wag- 
staff, in press). But from examination of the behaviors character- 
istic of each factor, broad behavioral variables can be identified, 
on which persons can be ordered. Thus delineations of classes of 
individuals may be used for the important purpose of uncovering 
constructs referring to more fundamental dimensions than the 
loosely conceived common traits so prevalent today. 

In the long run, the measurement of individual differences in per- 
sonality must move in the direction of the conditions utilized for 
measuring ability, but it must go further than any experimenter 
does today. We must develop tasks, each with instructions which 
are sufficiently unambiguous so that all subjects will seek to maxi- 
mize the same thing, whether or not the thing maximized is what 
the experimenter actually counts to obtain a score. The result will 
be scores which are stable for each individual and comparable 
across individuals. Only with such measurements can we hope to 
establish dependable findings which can serve as grist for the theo- 
tetician’s mill. We will attain success most readily in the areas of 
cognitive practices, perceptual dispositions, and work habits. It will 
be some time before we can handle the problems in the area of 
motivation. 


Summary 


„Any procedure for determining individual differences should be 
Viewed as an experiment. Psychology has developed а common set 
of experimental conditions for assessing ability. However, the meas- 
urement of personality ordinarily utilizes а different set of condi- 
tions which have made it extremely difficult to obtain adequate 
Scores comparable across individuals. 

No ready solution to this problem is apparent. On the one hand, 
the experimenter сап seek to refine or control the conditions under 
Which he attempts to measure personality so that the resulting 


264 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT = — 


scores are determined to a minimal extent by factors other than 
the variable in which he is interested. But such scores may not be 
clearly reperesentative of subjects’ behavior outside the test situa- 
tion. On the other hand, he can employ a relatively free and more 
lifelike situation and classify individuals by quality or type of. 
response. 1 

Further investigation and analysis of the experimental condi- 
tions for measuring individual differences is required. Sophistiea! 
technical analyses cannot take the place of adequate specifica 
and control of the conditions necessary for measurements having: 
theoretical and practical value. i 


REFERENCES 


Anastasi, Anne. “The Nature of Psychological Traits.” Psychologi- 
cal Review, LV (1948), 127-138. a 
Atkinson, J. W. “Personality Dynamics.” Annual Review of Psy- 
chology, XI (1960), 255-290. ? 
Bergman, С. and Spence, К. W. “The Logic of Psychophysical 
Measurement." Psychological Review, LI (1944), 1-24. d 
Brunswik, E. Systematic and Representative Design in Psychologi- 
cal Experiments. Berkeley: University of California Press, 1947. - 
Butler, J. M. “Measuring the Effectiveness of Counseling and Psy- 
Сарт Personnel and Guidance Journal, 1953, October, _ 
Butler, J. M. and Fiske, D. W. “Theory and Techniques of Assess- 
ment.” Annual Review of Psychology, VI (1955), 327-356. 
Butler, J. M., Rice, Laura N., and Wagstaff, Alice. “Оп the Natu- 
ralistie Definition of Variables: An Analogue of Clinical Analy- 
sis.” In H. H. Strupp and L. Luborsky (Editors), Research in 
Psychotherapy. Volume II. Washington, D. C.: American Psy- 
chological Association, 1962. BE 
Butler, J. M., Wagstaff, Айсе K., and Rice, Laura N. “Naturali 
Observation and Research.” Chicago: University of Chic 
Counseling Center Discussion Paper, VI (1960), No. 17. , 
Campbell, D. T. “A Typology of Tests, Projective and Otherwise." 
Journal of Consulting Psychology, XXI (1957), 207-210. $ 
Campbell, N. R. An Account of the Principles of Measurement ana 
Calculation. London: Longmans, Green and Co., Ltd., 1928. - 
Cattell, R. B. Description and Measurement of Personality. Yon- 
kers-on-Hudson: World Book Company, 1946. 
Coombs, C. H. “The Scale Grid: Some Interrelations of Data 
Models." Psychometrika, XXI (1956), 313-329. 
Coombs, C. Н. “A Theory of Data.” Psychological Review, 
(1960), 143-159. , 
Cronbach, L. J. “Response Sets and Test Validity." EDUCATIONAL 
AND PSYCHOLOGICAL MEASUREMENT, VI (1946), 475—494. | 


FISKE AND BUTLER 265 


Cronbach, L. J. “Further Evidence on Response Sets and Test De- 
sign.” EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, X 
(1950), 3-31. 

Cronbach, L. J. “The Two Disciplines of Scientific Psychology.” 
American Psychologist, XII (1957), 671-684. 

Duffy, Elizabeth. “The Psychological Significance of the Concept 
of ‘Arousal’ or ‘Interaction’.” Psychological Review, LXIV 
(1957), 265-275. 

Edwards, W. “Costs and Payoffs are Instructions.” Psychological 
Review, LXVIII (1961), 275-284. 

Ferguson, С. A. “Оп Learning and Human Ability." Canadian Jour- 

. nal of Psychology, VIII (1954), 92-112. 

Fiske, D. W. “The Inherent Variability of Behavior.” In D. Fiske 
and S. Maddi (Editors), Functions of Varied Experience. Home- 
wood, Ill.: Dorsey Press, 1961. б 

Goodenough, Florence. “The Appraisal of Child Personality.” Psy- 
chological Review, LVI (1949), 123-131. 

Guilford, J. P. Personality. New York: McGraw-Hill, 1959. : 

Gulliksen, H. Theory of Mental Tests. New York: John Wiley & 
Sons, 1950. 

Hebb, D. О. “Drives and the C. N. S. (Conceptual Nervous Sys- 
tem)." Psychological Review, LXII (1955), 243-254. : 

Krathwohl, D. R. and Cronbach, L. J. “Suggestions Regarding a 
Possible Measure of Personality: The Squares Test.” EDUCA- 
E AND PSYCHOLOGICAL MEASUREMENT, XVI (1956), 305- 


Levinson, D. J. “A Note on the Similarities and Differences between 
psum Tests and Ability Tests.” Psychological Review, LIII 

› 189-194. 
McClelland, D. C., Atkinson, J. W., Clark, R. A., and Lowell, E. L. 
The Achievement Motive. New York: Appleton-Century, 1953. 
MeQuitty, L. L. “Conditions Affecting the Validity of Personality 
Inventories, I.” Journal of Social Psychology, XV (1942), 33- 


MeQuitty, T, L. “Typal Analysis." EDUCATIONAL AND PSYCHOLOGI- 

M CAL MEASUREMENT, XXI (1961), 677-696. ۴ A 
almo, R. B. “Activation: A Neuropsychological Dimension.” Psy- 
chological Review, LXVI (1959), 367-386. ү 
ollenkopf, W, G. “Time Limits and the Behavior of Test Takers. 
203 тон AND PSYCHOLOGICAL MEASUREMENT, (1960), 

Sarason, 8, B. The Clinical Interaction: With Special Reference to 

8 the Rorschach. New York: Harper and Brothers, 1954. 

awyer, J. and Nosanchuk, Т. A. “Analysis of Sociometrie Struc- 
ture: A Method of Successive Grouping,” In Proceedings of the 
ed Statistics Section, American Statistical Association, 1960, 


| Schachtel, E, G, “Subjective Definitions of the Rorschach Test Sit- 
“ation and Their Effect upon Performance. Contributions to 


266 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Understanding of Rorschach's Test. IIL" Psychiatry, УШ 
(1945), 419-448. 

Secord, P. F. “An Analysis of Perceptual and Related Processes 
Occurring in Projective Testing.” Journal of General Psychology, 
XLIX (1953), 65-85. 

Sinclair, Edith J. The Relation of Color- and Form-Dominance to 
PURA Unpublished Ph.D. thesis, University of Chicago, 

Stevens, 8. 8. “Measurement, Psychophysics, and Utility.” In C. W. 
Churchman and P. Ratoosh (Editors), Measurement: Defini- 
tions and Theories. New York: John Wiley & Sons, 1959. 

Thurstone, L. L. The Development of Objective Measures of Tem- 
perament. Chapel Hill: Psychometric Laboratory, University of 
North Carolina. April, 1953. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vol. XXIII, No. 2, 1963 


RATINGS OF PROJECTIVE TEST PROTOCOLS AS A 
FUNCTION OF DEGREE OF INFERENCE? 


KENNETH I. HOWARD 
VA Hospital, Hines, Illinois and Stritch School of Medicine, Loyola University 


Іх a previous study it was concluded that ipsative ratings based 
on Rorschach, TAT, and Stein Sentence Completion Test proto- 
cols showed little convergent or discriminant validity (Howard, 


| 1962). The present paper is concerned with an evaluation of these 


projective instruments as potential sources of normative ratings to 
be used in clinical prediction. 

In the empirical prediction problem the investigator collects а 
large number of measures on a group of individuals and assesses 
the relation of each to the criterion behavior. Two salient require- 
ments of these measures are that they be sensitive to individual 
differences so that individuals can be differentiated along the dimen- 
sions concerned, and that they be relatively reliable. Projective 
instruments provide a rich source of variables which may be per- 
tinent to prediction in the clinical setting. 

Normative rating refers to the assignment of a position on a trait 
dimension relative to the individual’s position in the parent popula- 
Чоп, An example would be a rating of intelligence where each posi- 
tion on the dimension is equivalent to a percentile position. Norma- 
tive ratings which are based on projective test protocols can vary 
Ш degree of inference from the observed behavior. At the more 
Inferential end of the continuum are ratings which involve esti- 
mates of complex personality functions, e.g, ego strength. At the 
ea 

` Extended version of paper presented at the 1961 meetings of the American 


Psychological Associati i his ciation 
iation, The author wishes to express his warm appre 
mae critical assistance of J. C. Stanley, D. W. Fiske, D. T. Campbell, 


: T. A. Vernon, and Н. Diesenhaus. 


267 


268 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


more objective end of the continuum are ratings which involve 
estimates of behavior of which the test performance provides a 
sample, e.g., verbal organization. The purpose of this study is to 
investigate the reliability and validity of normative ratings based 
on projective test protocols over a range of required rater infer- 
ence from test performance. 


Method 


Five traits were chosen to cover a range of objectivity. Seven 
clinical psychologists rated each of ten subjects on each trait. The 
ratings were based on Rorschach, TAT, and Stein Sentence Com- 
pletion Test (SSCT) protocols taken separately. Hach rater, there- 
fore, rated each subject on the basis of each of the three tests yield- 
ing twenty-one independent ratings for each subject on each trait. 

For each trait, rater agreement within a test was taken as an 
estimate of the reliability of the ratings, ie. each pair of raters 
was correlated for each test separately. Rater agreement across tests 
was taken as an estimate of the validity of the ratings. An example 
of a validity correlation would be that between Rater A rating 
from the Rorschach and Rater B rating from the TAT for the same 
trait. In this case ratings based on one test were validated against 
ratings based on each other test. 


Subjects 


The subjects were patients who had been referred to the Psychol- 
ogy Service of a Veterans Administration Hospital for psychologi- 
cal testing. All were male veterans, The test protocols had been 
recorded at least eight years prior to the beginning of this study 
so there was little possibility that any rater would have acquaint- 
ance with any subject. There were ten subjects in the sample— 
four diagnosed as Schizophrenic, three diagnosed as Neurotic, and 
three diagnosed as Inadequate Personality. The raters were not 
informed as to the diagnostic makeup of the sample. 


Raters 


The seven raters were all members of the Psychology Service of 
the same hospital. They ranged in clinical experience from Ww 
to fifteen years. Four of the seven raters had Ph.D. degrees in Clini- 


KENNETH I. HOWARD 269 


cal Psychology while the remaining three were fourth year graduate 
interns in the clinical psychology training program. 


Tests 


The protocols of the three tests for each subject were used as 
the information to be rated. All protocols were typed and coded 
in order to minimize irrelevant cues which would indicate which 
protocols belonged to the same subject. The Rorschach protocols 
ranged from 13 to 43 responses with a mean of 30.5 responses per 
record. The TAT protocols contained only the stories based on 
cards 1, 2, 3, 6, 7, 12, and 13. The SSCT consisted of the standard 
100 sentence stems. 

The protocols for a particular test for all ten subjects made up 
one packet. A rater was given one packet to complete and when 
finished with his ratings was given the next packet. The subjects 
were in the same order for each rater, but in a different order for 
each test. The order of presentation was SSCT, TAT, and Ror- 
Schach. 


Ratings 


The task of the rater was to rate each protocol on five ten-point 
rating scales as part of а larger rating task. The traits to be rated 
were Psychotherapy Prognosis, Adjustment, Intelligence, Produc- 
tivity, and Verbal Fluency. These traits range respectively from а 
high degree of inference from test performance to a low degree. 
(The ordering of the traits along the objective-inferential continuum 
was done independently by a group of clinical raters.) 


Results 


The first, analysis attempted to evaluate the main and interaction 
effects in the entire matrix of ratings. In order to have 8 balanced 
matrix, one Ph.D. rater and one schizophrenic subject, were elimi- 
nated. This left us with a double-nested six classification matrix 
consisting of three raters within each of two experience groups 
* (pre-Ph.D. and post-Ph.D.), three subjects within each of three 
diagnostic groups, three tests, and five traits. е 

Table 1 shows the results of the six-way analysis of variance 
(Stanley, 1961a) which was based on the matrix of ratings. The 
important effect here was the significant subject X trait interaction 


270 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


which indicated that the subjects were being discriminated from one 
another on the trait dimensions. The significant subject X trait X 
test interaction, which results from some subjects receiving higher 
ratings on a trait on one test than on another test, represents an 
© effect which would serve to lower validity (the correlation across 
tests), as would the subject X test interaction. The subject X 
rater and subject X rater X test interactions represent effects which 
would serve to lower reliability (the correlation within tests). 


TABLE 1 


Summary Table of Analysis of Variance of Subjects X Raters X 
Tests X Traits X Diagnosis X Experience 


oo 


F Test 

Source SS df MS Denominator F p 
Subjects (в) 547.98 6 91.33 24.79 3.68 <.05 
Raters (т) 31.15 4 7.79 12.62 0.62 NS 
Tests (о) 25.83 2 12.9 35.92 0.36 NS 
Traits (t) 203.97 4 50.99 18.60 2.74 <.05 
Diagnosis (d) 111.35 2 55.68 96.02 0.58 NS 
Experience (e) 1.98 1 1.98 3.58 0.55 NS 
s Xr 245.00 24 10.21 1.61 6.34 <.05 
в Xo 472.68 12 39.39 10.63 3.71 <.05 
а Xt 388.63 24 16.19 1.61 10.06 <.05 
в Хе 59.06 6 9.84 10.38 0.91 NS 
r Xo 41.55 8 5.19 7.87 0.66 NS 
rx 64.34 16 4.02 1.61 2.50 <.05 
r xd 66.39 8 8.30 11.25 0.74 NS 
oXt 40.66 8 5.08 6.20 0.82 NS 
оха 35.42 4 8.86 33.01 0.27 NS 
o Xe 17.85 2 8.92 77.63 0.11 NS 
t Xd 196.37 8 24.55 25.26 0.97 NS 
t Xe 3.22 4 0.80 4.64 0.17 NS 
dXe 10.61 2 5.30 5.96 0.89 NS 
в Хт Хо 361.94 48 7.54 2.78 2.71 <.05 
sxXrXt 154.59 96 1.61 2.78 0.58 NS 
sXtXo 281.00 48 5.87 2.78 2.11 <.05 
rXtXo 99.54 32 3.11 2.78 1.12 NS 
sxext 53.51 24 2.23 1.61 1.38 NS 
sXexo 973.51 12 81.13 8.03 10.10 «.05 
гхах + 84.78 32 2.65 1.61 1.38 NS 
rXdXo 65.80 16 4.11 7.08 0.58 NS 
dXeXt 10.48 8 1.81 3.27 0.40 NS 
exdXo 29.71 4 7.43 19.58 0.09 NS 
dXoXt 39.42 16 2.46 5.41 0.45 NS 
eXoXt 16.92 8 2.12 3.60 0.59 NS 
sXexXtXo 157.13 48 3.27 2.78 1.18 NS 
oxtXdXr 148.25 64 2.32 2.78 0.83 NS 
dxeXtXo 74.98 16 4.69 2.81 1.67 NS 
sXrXoxt 533.47 192 2.78 

Total 3493.02 809 4.32 


KENNETH I. HOWARD 271 


TABLE 2 


Р Values from Three-Way Analyses of Variance of Ratings of Psychotherapy 
Prognosis (PP), Adjustment (A), Intelligence (I), Productivity (P), 
and Verbal Fluency (VF) 


Source PP A I E VF df 
Subject (s) 3.42** 4.69** 12.24** 13.55** 17.36** 9/54 
Rater (г) .72 .97 2.89* 4.85** 8.12* 6/54 
Tet (t) .37 .46 .10 1.44 .29 2/108 
(s)X(r) 1.17 1.66* 1.01 1.86 1.24 54/108 
(s) x (t)  2.25** 4.29** 3.99** 7.74** 9.86** 18/108 
(r) X (t) .77 .61 1.45 1.28 .81 12/108 
Variance 2.76 2.97 3.30 4.95 5.26 

* p «.05 

"p «01 


The remaining analysis was patterned after that proposed by 
Stanley (1961b). Using this system, three-way analyses of variance 
were first computed for each trait. From these analyses of variance, 
estimates of pertinent mean correlations were obtained. 

Table 2 illustrates the F values for each of the five analyses of 
variance. As this table indicates, each trait showed significant differ- 
ences between subjects. The subject effect tended to increase in 
size as the traits became more objective (less inferential). All sub- 
ject-test interactions were significant and also showed the general 
trend of increasing as the traits became more objective. Intelligence, 
Productivity, and Verbal Fluency ratings yielded significant rater 
main effects. 

Table 3 gives the mean correlations for each trait. Average rater 
agreement (reliability) showed a systematic increase as the traits 
became more objective. The average correlation between tests 


TABLE 3 


Estimated Mean Correlations Over Subjects for All 
Possible Comparisons of Ratings 
Pe ee 


Between raters—same test .19 .33 .45 .56 .63 
Between raters—between tests .06 .07 .22 .19 .19 
Between tests—same rater ‚10 ‚19 ‚22 24 22 
Between raters—sum of tests .26 .35 .62 ‚64 .70 


* PP = Psychotherapy Prognosis 
A = Adjustment 
I = Intelligence 
P = Productivity 
VF = Verbal Fluency 


272 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


(validities) was low and did not show this trend. The average of 
the correlations between subjects was .02. 

The next analysis was directed at ascertaining whether some of 
the traits were rated better (more reliably) on the basis of one 
test than another. A two-way analysis of variance was computed 
for each test-trait combination using the components of variance 
model without replication (McNemar, 1957, pp. 308-309, Model 
VI). Each of these fifteen analyses of variance involved 70 ratings 
—seven raters by ten subjects—and yielded the main effects of sub- 
ject and rater as well as the subject-rater interaction. All subject 
effects were significant beyond the .05 level. 

Reliability (inter-judge agreement) can be defined as the pro- 
portion of between subject variance to total variance (Tryon, 1957). 
In this case the effect of rater had no influence on the reliability, 
but merely reflected the average level of ratings. The rater effect, 
however, did contribute to the total variance. Consequently, rater 
variance was subtracted from the total variance. The coefficients re- 
ported in Table 4 represent the ratio of the between subject vari- 
ance to the total variance minus the between rater variance. 

No attempt was made to assess the significance of differences 
between these correlations. It is suggestive that ratings of Adjust- 
ment based on the Rorschach were more reliable than those based 
on the other tests and that the Rorschach provided the least reliable 
ratings for Intelligence and Verbal Fluency. The TAT yielded the 
most reliable ratings of Productivity and Verbal Fluency. Ratings 
of Intelligence from the TAT and SSCT were equally reliable. The 
ratings of Psychotherapy Prognosis from each test were uniformly 
low. Each test, however, showed the already noted trend of increas- 
ing reliability with increasing objectivity of the traits. 


TABLE 4 


Estimated Average Inter-Judge Agreement for Ratings 
of Each Trait from Each Test 


کے 
Test ere A I P VF‏ 
tele ay oh Po. YF ш‏ 
SSCT .27 .45 .59 .51 .69‏ 
ТАТ .35 .23 .58 .72 .74‏ 
Rorschach .93 .57 .35 .64 .57‏ 


* РР = Paychotherapy Prognosis 
А = Adjustment 
I = Intelligence 
P = Productivity 
VF = Verbal Fluency 


KENNETH I. HOWARD 273 
Discussion 

Rater agreement is dependent on two effects: (a) the main effect 
of subject, and (b) subject-rater interaction. Inspection of Table 
2 indicated that the subject-rater interaction effect was in general 
not significant. Consequently, rater agreement was mainly a func- 
tion of the ratio of between subject variance to error variance— 
the higher this ratio the higher the reliability of the ratings. The 
resulis showed that rater agreement, varied as a function of the 
objectivity of the trait—the more objective the trait, the more 
rater agreement. The two most objective traits (Productivity and 
Verbal Fluency) had an average reliability of .67 while the two 
most inferential traits (Psychotherapy Prognosis and Adjustment) 
had an average reliability of .31. 

The agreement between tests (validity) is dependent on two 
effects: (a) the main effect of subject, and (b) test-subject inter- 
action. The test-subject interaction showed a general increase in 
strength as a function of the objectivity of the trait—the more 
objective the trait, the more dependent was the rating on the per- 
formance on the particular test. The tests provided different stimu- 
lus situations and consequently different performances. As a result, 
the increase in reliability of the more objective ratings was not 
reflected in the test agreement. The average test agreement for the 
two most objective traits was .30 while the average agreement for 
the two most inferential traits was .21. 

Guilford (1954, pp. 281-288) has provided a method of correcting 
for the test-subject interaction effect under the general rubric of 
correcting for bias in ratings. Stanley (1961b) has also proposed а 
method of correcting for systematic biasing effects, Using these 
methods the test agreement correlations could be increased. It can 
be seen, though, that the correlations for the more inferential traits 
Were quite close to the rater agreement correlations. This indicates 
that the test agreement correlations could not be appreciably in- 
creased for these ratings as compared with the more objective 
ratings, 

The trait Intelligence showed а slightly different picture. This 
was the only trait to show respectable rater agreement and test 
agreement (.62 and .41, respectively). It can be seen that Intelli- 
gence ratings showed relatively little test-subject interaction, indi- 


274 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


cating that estimates of Intelligence were not strongly dependent 
on the stimulus determinants of the particular performance. Raters 
were able to reliably rate this trait and the trait showed some gen- 
erality across the three test performances. The results of the analy- 
sis of the Intelligence ratings suggest that there may be some op- 
timum position in the middle of the objective-to-inferential 
continuum which may yield reliable and valid ratings. 

Some brief comments on the main effect of rater are in order. In- 
spection of Table 2 indicated that the most inferential ratings 
showed no rater effect, i.e., the raters tended to use the same range 
on these scales. Intelligence, Productivity, and Verbal Fluency 
ratings did show rater effects, indicating significant differences be- 
tween the average ratings of each rater. For these ratings some 
raters seemed to impose different standards on each scale. It should 
be noted, however, that this effect would have little influence on 
rater or test agreement correlations. 

One of the problems in empirical prediction is the selection of 
appropriate and meaningful variables. It is always tempting to 
translate the richness of projective protocols into numerical esti- 
mates of clinical concepts like Adjustment or Psychotherapy Prog- 
nosis. This study indicates that the further one moves from the test 
performance itself, the poorer the reliability of the estimates, and 
that these estimates will show little generalization across tests. It 
is not surprising that some individuals respond to one test better 
than to another (test-subject interaction). What is surprising is 
that ratings of Adjustment, for example, on one test have no relation 
(.07) to ratings of the same concept based on another test. 

The analyses of each test-trait combination yielded differences 
in rater reliability across tests. It is noteworthy, however, that one 
of the biggest differences occurs for the ratings of Intelligence—the 
trait showing the best validity. 


Summary and Conclusions 


The purpose of this study was to investigate the reliability and 
validity of ratings of projective test protocols as a function of the 
objectivity of the traits rated. Seven clinical psychologists rated 
ten subjects on five traits. The ratings were based on Rorschach, 
TAT, and Stein Sentence Completion Test protocols. The traits 
were Psychotherapy Prognosis, Adjustment, Intelligence, Produc- 


KENNETH I. HOWARD 275 


tivity, and Verbal Fluency. These traits ranged from highly infer- 
ential to fairly objective in the order given. The ratings on each 
trait were analyzed in a series of analyses of variance and correla- 
tions were computed using a system devised by Stanley (1961b). 

All traits showed significant differentiation of the subjects. The 
strength of this discrimination increased as the traits became more 
objective. Rater agreement showed the same trend and ranged 
from .19 for Psychotherapy Prognosis to .63 for Verbal Fluency. 
Agreement between tests (validity) was in general low and did 
not show this trend. The absence of trend appeared to be due to 
the relatively stronger test-subject interaction effects in the more 
objective ratings. As the trait became more objective, the ratings 
became more dependent on the specific performance called for by 
the test. Ratings of Intelligence had an average rater agreement of 
62 and an average test agreement of .41. This finding indicated that, 
along the inferential-objective continuum, there may be an optimum 
point at which normative ratings from projective test protocols 
would be both reliable and valid. 

Inter-judge agreement was estimated separately for each test for 
each trait. It was noted that the Rorschach yielded the most reli- 
able ratings of Adjustment and the least reliable ratings of Intelli- 
gence and Verbal Fluency. The TAT yielded the most reliable 
ratings of Productivity. All ratings of Psychotherapy Prognosis 
showed low reliability. 


REFERENCES 


Guilford, J. P. Psychometric Methods (Second Edition), New York: 
McGraw-Hill Book Company, 1954. 1 e 

Howard, К. I. “The Convergent and Discriminant Validation of 
Ipsative Ratings from "Three Projective Instruments.” Journal 
of Clinical Psychology, XVII (1962), 183-188. : 

on, Q. Psychological Statistics. New York: John Wiley & 

ons, 1955. 

Stanley, J. C. “Analysis of a Doubly Nested Design.” EDUCATIONAL 
AND PSYCHOLOGICAL MEASUREMENT, XXI (1961), 831-837. (a) 

Stanley, J. C. “Analysis of Unreplicated Three-Way Classifications, 
with Applications to Rater Bias and Trait Independence.” Psy- 
chometrika, XXVI (1961), 205-220.(b) | .— 

Tryon, В. С. “Reliability and Behavior Domain Validity: Reformu- 
lation and Historical Critique.” Psychological Bulletin, 
(1957) , 229-249. 


| 
| 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vol. XXIII, No. 2, 1963 


PREDICTION OF PROFICIENCY IN A MODERN AND 
TRADITIONAL COURSE IN BEGINNING ALGEBRA 


H. G. OSBURN 
University of Houston 
AND 
R. 8. MELTON 


The Psychological Corporation 


Introduction 


A small revolution in the teaching of secondary school mathe- 
matics is currently underway. The major impetus for change stems 
from a widespread dissatisfaction among mathematicians and edu- 
cators with a mathematics curriculum that has remained largely 
unchanged for years, despite a mushrooming of new developments. 
As one writer points out, “. . . on one thing almost everyone, agrees, 
the old mathematics course must go. It does not tell the students 
what mathematics is all about. It does not give them any real un- 
derstanding of the principles of the subject. Tt is so far behind the 
times that it leaves out practically all of the new ideas and dis- 
coveries of the past 100 years. And above all it has managed to make 
mathematics about the most unpopular of all branches of learning. 
...” (Rosenbaum, 1958). 

There are several programs aimed at modernizing the teaching 
of high school mathematics. Foremost among these is that of the 
Commission on Mathematics of The College Entrance Examination 
Board (Report of the Commission, 1959). Another is the University 
of Illinois program under the direction of Max Beberman (MeCoy, 
1959). A third, with which the present study is concerned, is the 
Developmental Project in Secondary Mathematics at Southern 
Illinois University (Kenner & Small, 1959). 


277 


278 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


The course that has been developed under this project was de- 
signed to present mathematies as a logical-deductive system, with 
а more generalized and coherent approach to algebraic concepts. 
By implication it should result in greater understanding and less 
relianee on rules and procedures. Thus, it appeared plausible that 
a different pattern of abilities might be called for on the part of the 
student, and that some students might be better suited for such a 
course than for the traditional eourse. As part of a cooperative 
evaluation of this course by the writers and authors of the course 
materials, this hypothesis was subjected to investigation. 


Design 
Locale and Description of the Courses 


The study was conducted at Central High School, Cape Girar- 
deau, Missouri, during the 1959-60 school year. There were three 
experimental and three traditional classes of beginning algebra. 
Three teachers were involved.! One teacher taught one experimental 
class, one taught two experimental and one traditional class, and 
the third taught two traditional classes. The first two teachers had 
participated in a four-day summer institute on the experimental 
text materials and were teaching the experimental course for the 
first time. 

The experimental course was a two-semester sequence in begin- 
ning algebra. The text, Elementary Concepts of Secondary School 
Mathematics (Kenner & Small, 1959), was divided into ten units: 
symbols and sets, the set of integers, graphs, the set of rationals, 
variable sentences in the set of rationals, number theory, algebraic 
number theory I—monomials, algebraic number theory II—tri- 
nomials, introduction to statistics, and mathematical systems. The 
text used definitions, axioms, and inductive demonstrations to de- 
velop the principles and theorems of algebra. The terminology and 
concepts of modern abstract algebra were used, and an attempt 
was made to give the student some notion of how the concepts used 
in algebraic operations are built up from more primitive concepts. 


1 Grateful acknowledgement is made for the assistance given to this study 
by Miss Grace Williams, Head of the Mathematics Department, and Mrs. 
Laura Rixman and Mr. Robert Ford, teachers of mathematics at Central High 
School, Cape Girardeau, Missouri. 


OSBURN AND MELTON 279 


The course contained work on sets and inequalities, topics not in- 
cluded in the traditional course, and it emphasized graphing and 
number theory to a much greater extent than the traditional course. 

The text for the traditional course was A First Course in Algebra 
by Walter W. Hart (1947). It contains 18 chapters, as follows: 
formulas, simple equations, signed numbers and monomials, poly- 
nomials, equations of the first degree—one unknown, equations of 
the first degree—two unknowns, products and factoring—quadratic 
equations, algebraic fractions, indirect measurement, square root, 
quadratic equations, ratio—proportion—variation, and statistical 
graphs. In general, the text teaches algebraic operations by rules 
and demonstrations illustrating the rules. 


The Aptitude Measures 


Four aptitude batteries were administered to the subjects: the 
Iowa Algebra Aptitude Test, Revised Edition; the Orleans Algebra 
Prognosis Test, Revised Edition; the SRA Primary Mental Abil- 
ities, Third Edition, ages 11-17; and the Differential Aptitude 
Tests, Form A. The Iowa was administered to most of the students 
in the spring semester of the year preceding the study, the Orleans 
and the PMA were administered during the first two weeks of the 
fall semester, and the Differential Aptitude Test scores were avail- 
able from a statewide testing program carried out during the first 
weeks of the fall semester. 


The Proficiency Measures 


The writers, together with the authors of the experimental text, 
developed and administered three proficiency tests during the course 
of the year. In addition, the Cooperative Algebra Test, Form Z, 
and a final examination developed in conjunction with the three 
instructors were administered at the end of the school year. For the 
most part, the specially-developed proficiency tests (QR-1, QR-2, 
and QR-3) covered topics common to both the experimental and 
traditional courses, but comprehensive coverage was attempted only 
in the final examination. 

QR-1 (on negative numbers) was administered to all six classes 
on October 29, 1959, about one week after the experimental classes 
had completed the unit on the set of integers and one day after 
the traditional class had completed their work on signed numbers. 


280 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


(This test turned out to be highly speeded, and for this reason no 
reliability estimates were computed.) 

QR-2 (on equations) was administered to all six classes on Jan- 
uary 14, 1960, after both experimental and traditional classes had 
had considerable work on equations. The split-half reliability esti- 
mates for the various parts of this examination ranged from .21 
to .77 (median .59). The total score reliability for the experimental 
group was .90; for the traditional group, .83. 

QR-3 was administered on March 30, 1960. Parts I and II of 
this test (on equations and number theory) were administered to 
both the experimental and traditional classes, but Part 3-E (on 
sets) was administered to experimental classes only and Part 3-T 
(on estimation) was administered to traditional classes only. Split- 
half reliability estimates for various parts of the test ranged from 
-65 to .86. The total score reliability for the experimental group 
was .84; for the traditional group, .83. 

The final examination consisted of twenty multiple-choice items 
(final-common items) that were administered to both experimental 
and traditional classes and twenty multiple-choice items (final- 
specific items) that were specific to either the traditional or experi- 
mental course. Split-half reliability estimates for final-common 
items and final-specific items ranged from .60 to .84. The total score 
reliability for the experimental group was .89; for the traditional 
group, .77. 


Subjects 


Aptitude test data were obtained on 82 students in the three 
sections of the experimental course and on 73 students in the three 
traditional classes. The number of cases involved in the various 
analyses varied slightly due to dropouts during the year and ab- 
sences on the days the proficiency tests were administered. There 
also was а small number of losses due to failure to follow instruc- 
tions properly. 

Students were assigned to the traditional and experimental classes 
without regard to ability, but an explicit random assignment pro- 
cedure was not feasible. The ratio of boys to girls varied slightly— 
experimental group containing 43 per cent boys and the traditional 
group 51 per cent boys. However, a comparison made between the 
students in the traditional and experimental classes with respect to 


OSBURN AND MELTON 281 


initial ability showed negligible differences between the two groups. 
The experimental group tended to be slightly superior on the Iowa 
Algebra Prognosis subtests, the SRA Primary Mental Abilities and 
the Differential Aptitude subtests, while the traditional group 
tended to be slightly superior on the Orleans Algebra Prognosis 
subtests. Two of the mean differences on the latter were significant 
at the 5 per cent level and one was significant at the 1 per cent level; 
but all other mean differences were slight and not significant statis- 
tically. It was concluded that the two groups were reasonably com- 
parable with respect to initial ability. 


Results 
Validities of the Aptitude Tests 


As can be seen from Table 1, the Iowa Algebra Aptitude Test and 
the Orleans Algebra Prognosis Test showed substantial validities 
in predicting proficiency in both the experimental and traditional 
groups. Also, the aptitude tests correlated equally well with all 
types of proficiency tests, even though the QR-1, QR-2, and QR-3 
proficiency tests each involved a fairly restricted content while the 
Coop and final were omnibus tests. In general, the Primary Mental 
Abilities tests showed lower validities as compared with the Iowa 
and the Orleans tests, but again the pattern of correlations tended 
to be rather consistent across proficiency tests. One point of interest 
is that both the Space and Word Fluency tests showed consistently 
higher correlations for the experimental group, particularly for the 
QR series of proficiency tests. Surprisingly, Verbal Meaning and 
Reasoning were generally the best predictors, and Number was the 
least effective. 

As seen in Table 2, the Differential Aptitude Tests, with the 
exception of Clerical Speed and Accuracy, showed fairly substantial 
validities, The best predictors, Verbal Reasoning, Numerical Abil- 
ity, and particularly the sum of these two tests (VR + NA), showed 
validities that were of the same magnitude as those of the Iowa 
and Orleans tests, The DAT Mechanical Reasoning test showed 
Consistently higher correlations within the experimental group; and 
Space Relations, in line with the findings on the PMA Space test, 
also showed this trend, although the differences were not so large. 
On the other hand, the DAT Spelling test showed consistently 
higher validities for the traditional group. 


282 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


+ ызда + тыва е-нО uo poeeq әзәм dnoi3 peurquioo 10j SUOTIN » 
"pex unaq s 
*peurquioo sdno13 чуо —0) *dno13 jeuonrpezr — *dnoi$ peyueupedxy—q « 


1$ 0€ GF se СР 18 6r 9r ES ££ S6 LE Ig #5 0 19 Т9 GL 99 99 749 торгы 
8с S6 LE 65 88 FG oF бӯ OS tc $6 SC 09 FG SF S9 99 £9 сә £9 99 (moj, dood 
66 21 6 FE OF OF tr TT? TIS 8с EI c£ re OF TF co sb 9c :09 19 09 [OL SUD 
96 22 9 98 ЄР IE Tg 09 IG сс OL ££ 19 9F 99 T9 69 19 19 $9 69 [POL GHD 
SE FS 85 £e FE 66 0c $9 SF Le 2% 10 OF 9F 98 89 29 1 19 OL #9 Гот THD 


D L a OL H OT @ DO L 4 Di. 4r oL = D "uen 
Aouany a AJequinN Suruosuo eovdg [9q29A HOL 19301, 
ром 
senimqy uN Алєшид vis Lavo LVVI 


8489,7, fiouotoifo4q эч] uo вәлоәў үоуо,], рио 
«вә, 11ү 101u2]g баюшыд VYS puo чәр, взвоибол vaqoDjy supoj4o ‘say, opnyd y vagobjyy эло usomjog suoi|o4107) 


I WISVIL 


OSBURN AND MELTON 


ыза + THUA LUD uo poesq әләм dnoi3 pourquroo 10] вчорунтәлогу а 
*ponymuo ееш) s 
*peurquioo sdno13 «yog —0) 'dno13 peuonrpsa —.L 10019 үєпәшипәйхд—Д 1 


89 c9 TL 90 09 09 #6 66 6Р 2с 08 Oc 9 60 LE ZF SF OF £9 GE cO #9 £9 69 -99 FG JC TOL PURI 
99 69 69 6F 09 FF £P I9 SE ZI II 60 92 9I I$ 10 Z9 CG OF GF OP #9 £9 CO #9 Lo OG IOL доогу 
89 69 Т9 ТЕ 9C FF ТЕ 68 бб SI 90 FI 6€ 80 /> £9 IG £9 09 OS 4g #6 Gb I9 «Р FE 18 WHOL £216 
T4 04 ZŁ 19 99 89 29 TO CP ALT 6I ZT 6I 90 c8 SF FF CG FG OG S9 G9 09 TL 89 29 SG TOL zu 
89 FPL £9 19 FG 6F GF IO TE LI 6I 9I OF LI GF OF CP LE GF FF 9 То 19 09 69 OL 08 TOL THUD 


Gia оля OL cv O Lu o L W OP Б OG TD os 


VN + UA  secuejueg — Sungpdg ‘oy "pdg — Zuruoseoq — suons[o  Buruosvay Липаңү Surosvou 
[OHIO PIN eovdg pensqy eouounN PA 


ss, opnandy qenuozogiq 


s7saJ, fioua12yfo4q IY} uo 210297 10101, IY} рио 9489,7, opnyyd y PUIL э} usomjog suoivjo440;) 
с WISVIL 


284 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


A review of Tables 1 and 2 by columns reveals that there were 
no trends in the validity coefficients that could be attributed to the 
time interval between the administration of the ability battery and 
the various proficiency tests. As noted earlier, the validities of the 
various ability measures show a remarkable consistency across pro- 
ficiency measures in spite of the fact that the proficiency tests dif- 
fered in content and time of administration. 


Differential Validities of Certain Aptitude Tests 


As noted in the previous section, certain of the aptitude tests 
appeared to have noticeably higher validities for one or the other 
of the two types of courses; while the differences are not overpower- 
ing, they do provide some measure of substantiation for the hy- 
potheses under test. There were six tests which gave evidence of 
such differences, PMA Space, PMA Word Fluency, DAT Space 
Relations, DAT Mechanical Reasoning, DAT Spelling, and Part 5 
of the Orleans Algebra Prognosis Test. Table 3 provides a complete 
listing of all the correlations between these six tests and eighteen 
part and total scores of the proficiency measures, for both the ex- 
perimental and traditional groups. While for the most part the 
differences in correlations were not statistically significant, there 
was a remarkable consistency, which is all the more notable in 
view of the difference in content of the various proficiency subtests. 

The spatial and mechanical reasoning tests—PMA Space, DAT 
Space Relations, and DAT Mechanical Reasoning—generally 
showed higher validities for the experimental group. The clearest 
evidence was in the case of the DAT Mechanical Reasoning test, 
which showed a higher validity for the experimental group on all 
eighteen proficiency measures, with a median validity of .35 for 
the experimental group and a median validity of .08 for the tradi- 
tional group. The PMA Space test showed higher validities on 
seventeen out of eighteen proficiency measures with a median valid- 
ity of .32 for the experimental group and .14 for the traditional 
group. The DAT Space Relations test showed a higher validity for 
the experimental group on thirteen out of eighteen proficiency 
measures, with a median validity of .48 for the experimental group 
as compared with .39 for the traditional group. 

There was a consistent difference in validity in favor of the ex- 
perimental group on two other tests. The Orleans Algebra Prog- 


I ia t ue ЫЫ ЕНЕ EE ble M ee ЫИ 
9F ze 0z 1g c ey 80 се 6g SF FI ze uerpojq 
18 Lb ze FF [74 68 00 OF 9% 6F ez 1g guioj[ 
чошшогу теч 
19 9g 8c 1g Ig yv 9r I ze 29 ez 82 ToL 4000 
6с ¥ OI тс 91 FE 90 98 ЄР 1e [41 & eeg doo) 
eg Ze 1g [3 [44 6g 16 se 6 6F FE & ыза doo) 
z 4 eg 1% ge og o [i 08 OF Ly FI Га reg doo) 
S 9 6c or ye 0g 6g 50 42 oF от 0£ съза Sud 
EUR £v £I 9e FI 18 Т 1 8б $c £I se Td SUD 
= +9 a zZ ey ec ©9 so ze FF [^] от ee moL ca 
s Ww 9c їс #2 16 6 90 © v FE от ez ymd cuo 
TE LI тї 1g +0 ey 90 91 p og €0— zz emd cuo 
E IG 1g 0c oF 25 yo 60 ec or 6F 90 I zed cub 
< 2 ey Di ye IG 98 y0— #8 те ey ст 0 Td Cu 
19 1g 8e 8g 8z 09 LI [42 [42 [24 9t 1e PL ruo 
a 86 0 vC 9T 0 9% OI 20 ZI £I zz og emed THD 
m % 91 90 80 I1 1g €0 ze ee 1g 10- 98 ymd rau 
a 5 IC 7e 9% $8 68 0 yg 1g £9 98 1g gi rub 
ЄР 0g or SF ez вс ec 6c eg єс 9% FE gwd THD 
gr ye 6t 8£ rat oF [4i 82 [74 9g F0 74 Turd rub 
"ein ‘dx "релі, "хя "реді, "ах "реді, "хя PUL "хя "реді, CdXq WL 
ав-туа яМ-Үйа ¢-LdvO HWALVG HSHLYd SYNA &ouatogo1d 


епо) pouompo.y, рио үрүшәшыгйтя Ag sounsvo jg Rounoforg oy pun вә, бү PAG usongg suosmjoaioo) 
є TISVL 


286 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


nosis Test, Part 5, which involves substitution of а number for a 
symbol in binomials and solving, showed a higher validity for the ` 
experimental group on all eighteen proficieney measures, with а 
median validity of .45 in the experimental group and а median va- 
lidity of .23 in the traditional classes. The PMA Word-Fluency 
test showed a higher validity for the experimental group on fifteen 
out of eighteen comparisons with a median validity of .37 for the 
experimental group and .20 for the traditional group. 

One test, DAT Spelling, showed consistently higher correlations 
for the traditional group in sixteen out of eighteen comparisons. 
The median validity was .46 as compared to a median validity of 
.82 in the experimental group. 

These findings clearly suggest the possibility of interaction be- 
tween ability and method of training. It is not clear from an analysis 
of the two courses why these six tests should have yielded the 
differential results, but, if subsequent investigations are able to 
confirm interactions of these kinds or others, they should be con- 
sidered carefully by mathematics educators in the development of 
new course materials and by secondary school authorities when 
implementing curriculum changes. It is quite possible that there 
is no one best method of teaching ninth grade algebra, and that 
maximum proficiency for the whole group may be attained by what 
Cronbach and Gleser (1957) call placement decisions. Rather than 
assign all students to one course regardless of aptitude pattern, it 
may be, to use a simple example suggested by the present findings, 
that students with high scores on spatial and mechanical reasoning 
tests would profit more from a less traditional course. Unfortu- 
nately, the present findings are only suggestive; considerable addi- 
tional information will be needed before firm recommendations can 
be made. 


Summary 
A battery of aptitude tests was administered to students in ап 
experimental modern ninth grade algebra course and students in 
a traditional ninth grade algebra course. Proficiency was measured 
at various points throughout the school year. The following results 
were obtained: 


(1) For the most part the aptitude tests were equally valid in 
predicting proficiency in either course. 


OSBURN AND MELTON 287 


(2) The sum of the Verbal Reasoning and Numerical Ability 
scales from the Differential Aptitude Tests predicted proficiency 
in both courses with validity equal to that of the Iowa Algebra 
Aptitude Test and the Orleans Algebra Prognosis Test, tests 
designed specifically for this purpose. 

(3) There were no trends in the validities of the aptitude tests 
as a function of the time interval between the administration | 
of the aptitude and proficiency measures. 

(4) Spatial and mechanical reasoning tests were more valid for 
the experimental course than for the traditional course. One 
part of the Orleans Algebra Prognosis Test and the PMA 
Word-Fluency test gave similar results, while the DAT Spelling 
tests gave characteristically higher validities in the traditional 
course. While no explanation was offered for these particular 
differential results, the potential significance of an interaction 
between aptitude patterns and method of teaching algebra was 
noted. 


REFERENCES 


Bennett, George K., Seashore, Harold G., and Wesman, Alexander 
G. Manual for the Differential Aptitude Tests. New York: The 
Psychological Corporation, 1959. à 

Cronbach, Lee J. and Gleser, Goldine C. Psychological Tests and 
Personnel Decisions. Urbana: University of Illinois Press, 1957. 

Greene, Harry A. and Piper, Alva. The Iowa Algebra Aptitude 
Test: Examiner’s Manual. Bureau of Educational Research and 
Services, State University of Iowa, Iowa City, 1942. 

Hart, Walter W. A First Course in Algebra. Boston: D. С. Heath 
and Company, 1947. 

Kenner, Morton R. and Small, Dwain E. Elementary Concepts of 
Secondary School Mathematics, Books I and II. Carbondale, 
Illinois: Southern Illinois University, 1959. ) ` 

Manual for the SRA Primary Mental Abilities. Chicago: Science 
Research Associates, 1958. ; ” 

McCoy, Eleanor М. “A Secondary School Mathematics Program. 
Bulletin of National Association of Secondary School Principals, 
XLIII (1959), 12-18. г, 

Orleans, Joseph B. Orleans Algebra Prognosis Test: Manual of Di- 
rections. New York: World Book Company, 1951. 

Program for College Preparatory Mathematics. Report of the Com- 
mission on Mathematics, College Entrance Examination Board, 
New York, 1959. = 

Rosenbaum, E. P. “The Teaching of Elementary Mathematics.” 
Scientific American, CX CVIII (1958), 64-77. 


AN oo ы... алал. .د سا ا‎ e 


SER å ٠ 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vol. XXIII, No. 2, 1963 


FACTORIAL INVARIANCE AND OTHER PSYCHOMETRIC 
CHARACTERISTICS OF FIVE OPINIONS ABOUT MENTAL 
ILLNESS FACTORS: T 


ELMER L. STRUENING 


Montrose V. A. Hospital 
AND 
JACOB COHEN 


New York University 


Ix an exploration of the mental illness domain, Cohen and Struen- 
ing (1962) identified five salient attitude dimensions underlying 
Opinions about severe mental illness and mental patients, These 
dimensions were separately identified in the anonymous responses 
of two samples of subjects to a set of 70 Likert-type opinion items 


referent to the cause, description, treatment, and prognosis of 


Severe mental illness, The two samples were composed of personnel 
from two large V. А. mental hospitals, one located in the northeast 
(N = 541) and the other in the midwest (N = 653). Each sample 
represented the complete spectrum of levels and functions of per- 
= B 
> This work was completed at the Franklin D. Roosevelt Veterans Adminis- 


_ Fation Hospital, Montrose, New York as part of the Veterans Administration 


r Шш Evaluation Project, Dr. Lee Gurel, Director. Dr. Richard L. 
iem e Was director of the project at the inception of this study. We wish 
this ank the many Veterans Administration personnel who participated in 
in hae particularly the eleven project Coordinators responsible for collect- 
Hing’ data: Drs. William Dobson, Roy Eck, Herman Efron, Gloria Fischer, 
T am Gordon, Ernest Kurtz, William Morris, Esther Toms, Leonard Ullman, 

at Vernallis, and Robert Walker. We wish to express our gratitude to 
i Watson Scientific Computing Laboratory, New York City, for giving us 
enden to their facilities. Finally we wish to thank Mrs. Catherine S. 

топ, the project secretary, for her signal contribution and our research 


assistante, Miss Ethel Haas, Mr. Richard Gowin, and Mr. William H j 
arefu] Processing of the data. 


'. ples, is defined аз the degree of similarity of the factor loading 


290 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


sonnel having frequent patient contact. The five factors were iden- 
tified through the method of multiple factor analysis (Thursto 
1947), using the standard centroid extraction procedure. The five 
factors were rotated to an orthogonal, simple structure solution by ` 
means of the quartimax analytic method (Neuhaus & Wrigley, | 
1954). The factors were named Authoritarianism, Benevolence, - 
Mental Hygiene Ideology, Social Restrictiveness, and Interpersonal И 
Etiology. Fifty-one items with mental illness content were selected - 
to define the five concepts or factors. Each factor is operationally ? 
defined by those items having high factor loadings on that par-, 
ticular factor. This brief description of the development of the OMI, © 


more detailed in Cohen and Struening (1962), serves as background | 
for the present study. 


Factorial Invariance across Three Samples 
Definition 


Factorial invariance, with similar variables and differing ваш 


pattern of a selected set of variables on a particular factor iden - 
tified in the samples in question. This is tantamount to saying 1 
when a concept is operationally defined by a composite of variables, ы 
the variables forming the composite play, within chance limits, the 
same role in defining the concept in a variety of samples. 


Procedure 


To investigate the factorial invariance of the 51 opinion items 
defining the five OMI factors, the scale was anonymously admin- 
istered to the personnel of ten additional V. A. mental hospitals. 
throughout the United States. Three of the ten hospitals were selected ? 
so as to maximize regional and subcultural differences and to ш- 
clude the greatest variety of hospital types in terms of staff-patient 
ratio, architectural structure, and size or average daily patient num- 
ber. This selection procedure was designed to provide three ve 
different samples on which to study the degree of factorial invari- - 
ance. For each hospital a representative sample of 400 respondents 
was selected. | 

The three hospital settings where the personnel were sampled are ү 
аз follows: ro 


STRUENING AND COHEN 291 


ospital III?: A small (400 bed), high staff (more staff than pa- 
ients). hospital located in a metropolitan area in the Rocky Moun- 
in region. Eighty per cent of the personnel of this hospital com- 
pleted the 51 items of the OMI. " 
Hospital IV: A medium sized (900 bed), low staff (more pa- 
nts than staff) hospital located near a metropolitan area on 
һе west coast. Ninety-five per cent of the personnel completed 
e OMI. » 
"Hospital V: A large (2000 bed), low staff (five patients to three 
staff members) hospital on the edge of a small city in southeastern 
Inited States. Eighty-six рег cent of the personnel completed the 
MI. 
Differences in religious preference are presented in Table 1, indi- 
eating that the three samples come from markedly different, pop- 
Ulations with respect to religious background. The average age 
feflected by the three samples is 38.8 years for Hospital III, 42.9 
Hospital IV, and 41.2 for Hospital V. Years of formal educa- 
tion vary from an average of 13.0 years at Hospital III to 11.3 at 
Hospital V and 12.4 years at Hospital IV. Hospital V has the great- | 
est number of long term employees, while Hospitals IV and III rate 
‘second and third on this characteristic. The ratio of male to female 
employees is slightly less than three to one; this ratio varies little 
oss the three samples. 


TABLE 1 
_ Frequency Distribution of Religious Preference across Three Hospitals 

е I I IIIu uI I I I I I I 
Religious Preference Hospital III Hospital IV Hospital V 


Baptist 21 65 191 
Catholic 42 107 12 . 
Episcopalian 1 13 8 
Jewish 1 17 T 
Lutheran 8 17 9 
Methodist 15 47 93 
Presbyterian 21 22 22 
No Preference 33 47 12 
Могтоп 230 1 1 
Other 18 64 44 
N 400 400 400 
ER "0 0 оодан NM rne 


E ^ The hospitals described herein are numbered III, IV and V in continuation 
Of, and in distinction from, the two hospitals previously reported in another 
Publication (Cohen & Struening, 1962). 


292 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


In contrast to the above differences among the three samples, the 
samples are identical in occupational composition. The composition 
was determined by the average number of subjects in 16 occupa- 
tional categories in 12 V. A. mental hospitals. The 400 subjects were 
distributed as follows: Engineering and Maintenance, 48; House- 
keeping and Supply, 26; Cooks and Kitchen Workers, 50; Nursing 
Assistants, 133; Nurses, 33; Psychologists, 6; Social Workers, 4; 
Special Service, 8; Physical Medicine and Rehabilitation Service, 
19; Physicians and Dentists, 6; Psychiatrists, 5; Dietitians, 5; Ward 
Secretaries, 6; Clerk-typists, 36; Miscellaneous, 15. 

In each case the 400 subjects were selected by randomly draw- 
ing the above specified number of subjects from each occupational | 
category of the obtained sample. The purpose of this strategem was 
to limit item variation and potential correlation due to differences 
in occupational composition. Thus variation in factorial structure 
is more likely to be a function of regional-cultural background and 
work situation differences rather than, for example, the proportion 
of professionally-trained respondents in the sample. The fact of 
marked differences among the means of occupational categories on 
these items has been demonstrated by Cohen and Struening (1962). 


Data Analysis. 


Each of the three hospitals is represented by 400 subjects selected 
according to the method described above. Their responses to the 
51 OMI items of mental illness content on a six point agree-dis- 
agree continuum are the data for three separate analyses. The set 
of 20,400 observations on each hospital was reduced to a 51 X 51 
product-moment correlation matrix. Five principal components fac- 
tors (Harman, 1960) were extracted from each of the three mat- 
rices and rotated to an orthogonal, simple structure solution by 
the varimax method (Kaiser, 1958). The five OMI factors were 
easily identified in the three solutions, with the exception of Hos- 
pital III where Factors A and D had to some extent merged. 

To assess the degree of factorial invariance, the 15 15 matrix 
of coefficients of proportionality (Burt, 1948) was computed on the 
15 arrays (5 factors times 3 hospitals) of 51 factor loadings. This 
set of 105 coefficients was reduced to a five by five matrix where the 
main diagonal elements are the means of the three coefficients of 
proportionality of all possible combinations of identical factors 


STRUENING AND COHEN 293 


(for example, Arr Ату, Аш Av, and Ary Ay) and the other elements 
are the means of the nine coefficients of proportionality resulting 
when the three factor loading arrays of one factor are compared 
with the three arrays of another factor (for example, Am, Азу and 
Ay with Brn, Bry, By for the AB coordinate). 


Results and Discussion. 


The magnitude of the diagonal elements, representing similar 
factor comparisons, is clearly greater than the above-diagonal ele- 
ments, reflecting the comparison of dissimilar factors. Criteria for 
the level of congruence which one should require for similar factors 
have not been developed (Harman, 1960). In a study by Tucker 
(1951), coefficients of .93 or above were accepted. However, the 
variables in Tucker's study were tests or item. composites rather 
than single opinion items with known high errors of measurement. 
Considering, in addition, that sample variation was maximized, it 
seems reasonable to judge the diagonal coefficients adequate with 
the possible exception of Factor C, Mental Hygiene Ideology. The 
items selected to define this factor vary considerably in their inter- 
relationships across these samples, with the possible inference that 
the structure of the Mental Hygiene Ideology domain varies some- 
what as one moves across cultural and hospital situations or settings. 

Factors A and D possess a moderate degree of congruence. Since 
these factors (Authoritarianism and Social Restrictiveness) are 
Viewed as analogous to Authoritarianism as measured by the F 
seale (Adorno, et al., 1950; Cohen & Struening, 1962) and some 
form of prejudice, a degree of congruence, both in terms of similar 

_ factor loading patterns and factor score correlation, is expected. In 
general there is a striking resemblance between the mean coefficients 
of proportionality, in terms of sign and magnitude (Table 2, above 


TABLE 2 : 
Mean Coefficients of Proportionality and Factor Score Correlations" 


Factor 73 B С р Е 
А 787 —202 —100 418 214 
В —383 810 272  -—383 039 
c —055 298 482 —210 150 
D 619  -—378  —210 825 140 
E 200 —012 193 170 861 


1 * Main diagonal and above diagonal elements: Mean coefficients of proportionality of factor 
array pairs. Below diagonal elementa: Intercorrelations of factor scores. 


294 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
diagonal) and the correlations between factor scores (Table 2, be- 


low diagonal, N—1200, based on the three samples described above). . ' 


This is a reasonable result when one considers that one condition, 
the intercorrelation of items within and between factor score com- 
posites, may result both in moderately high coefficients of propor- 
tionality and in moderate correlation of factor scores. 


Factor Scoring 
Procedure 


Once the factors in each of the three samples were identified and 
factorial similarity among hospitals was demonstrated, item selec- 
tion for the operational definition of each factor was done primarily 
by reference to each item’s factor loading pattern across the three 
samples. In general the assignment of items to factors was simplified 
by the fact that most of the items obviously “belonged” to a 
particular factor on the basis of high loadings on one factor and 
modest or low loadings on the other four factors. When the loading 
pattern was ambiguous (in 7 or 8 of the 51 items), a judgmental 
criterion of psychological meaningfulness or conceptual consistency 
was used. 


Results 


Using these standards, factors A, B, C, D, and E were defined 
by 11, 14, 9, 10, and 7 items, respectively. Assigning values of 1 to 6 
to the six point strongly agree-strongly disagree continuum, the 
following formulas are used to score the OMI:* 


Factor Constant Item Number 
A = 67 —ĎJ (1, 6,9, 11, 16, 19, 21, 39, 43, 46, 48) 
B = 31 +> (26, 32, 34, 36, 37, 40, 49) 
— J (2, 12, 17, 18, 22, 27, 47) 
C = 48 + (31) — У (3, 18, 23, 28, 33, 38, 44, 50) 
р = 47 «Y (840-—2D(4 7,14, 24, 29, 42, 45, 51) 
E = 43 -—J (5 10,15, 20, 25, 30, 35) 


з Copies of the research form of the OMI giving the item content for the 
following item numbers are available from Abacus Associates, Inc., 3280 Broad- 
way, New York 27, М.Ү. 


oe 


STRUENING AND COHEN 295 


- For example, to strongly agree with all items of Factor E results in 
` ascore of 43 — 7 = 36, indicating a strong endorsement of Inter- 
^ personal Etiology. 


Internal Consistency 


To make comparisons across populations using an item com- 
posite, the presence of factorial invariance is desirable as previously 
discussed. However it is possible, although unlikely, to have a high 
degree of factorial invariance without having a scale with high 
internal consistency. This could occur with similar factor loading 
patterns by a set of variables on a particular factor across samples, 
yielding high factorial invariance, but low internal consistency as а 
function of the low magnitude of the factor loadings. This in turn 
implies low inter-item correlations within factor composites and 
large errors of measurement. 


Procedure 


Tryon’s (1957) covariance form of the general formulas for 
the reliability coefficient (equivalent to the generalized Kuder- 
Richardson formula 20 or Cronbach’s alpha) was used to estimate 
the internal consistency of the five OMI factors on the three samples 
of 400 subjects. This coefficient estimates the correlation between 
the total score of k items drawn randomly from & particular item 
domain with the total score of another random set of k items. 


Results and Discussion 


The general level of the coefficients is adequate for group com- 
Parisons with the possible exception of Factor C. The items selected 
to define the Mental Hygiene domain have а generally low magni- 
tude of intercorrelation in the three samples as well as а varied 
Pattern of item-factor relationships across these samples. Thus the 


TABLE 3 
Internal Consistency Coefficients of the Five OMI Factors on Three Samples 
кем ————————— 


Pto Норы Hospital IV Бы” _ 


А 801 803 708 
B 725 716 698 
c 353 392 289 
D 712 765 706 
E 658 658 653 


296 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


factor is inconsistently defined and estimated with relatively large 
errors of measurement. By inference, this domain is either loosely 
organized within and inconsistently defined across samples, or the 
item stimuli are of poor quality. Obviously this domain requires 
additional conceptualization and empirical exploration. This is not 
to say that the other four domains are as simple in structure as the 
results indicate, even within the limited population selected for 
study. It is highly probable that each of the five domains identified 
is a multi-faceted phenomenon and would, with adequate conceptu- 
alization and item writing, yield more than one meaningful and 
relevant dimension. In addition, little thought is required to hy- 
pothesize other aspects of the general mental illness domain not 
brought to operational definition by this study. With the continued 
improvement of electronic computers and additional information 
from recent studies, this domain is ripe for more extensive explora- 
tion which should yield greater conceptual richness and highly 
reliable scales. The latter will pave the way for more comprehensive 
research in assessing the relevance of these attitudes to interpersonal 
transactions with those experiencing severe personality disorders 
and to the constructive modification of these attitudes. 


Sten Transformation of Factor Scores 


In order to compare a subject’s (or group’s) relative standing 
across the five OMI factors, the subject’s (or group’s) five OMI raw 
factor scores may be transformed to sten scores by referring to 
Table 4. This sten system (Canfield, 1951) yields a standardized 
one digit score with a mean of 4.5 and a standard deviation of 2. 


TABLE 4 
Transformation of OMI Factor Scores to Sten Scores 
ا ا‎ 
Sten A B С р Е 

dub cm. Ren cde TL NAUES. sro 

0 —6 —37 —18 —9 —8 

1 7-10 38-41 19-21 10-13 9-10 

2 11-14 42-44 22-23 14-17 11-13 

3 15-18 45-47 24-26 18-20 14-15 

4 19-23 48-50 27-28 21-24 16-18 

5 24-27 51-53 29-31 25-27 19-21 

6 28-31 54-57 32-34 28-31 22-23 

4 32-36 58-60 35-36 32-35 24-26 

8 37-40 61-63 37-39 36-38 27-28 

9 41+ 64+ 40+ 39+ 29+ 


STRUENING AND COHEN 297 


standardization population is composed of 3,149 subjects having 

mt contact with patients in the twelve widely separated 
tric Evaluation Project V. A. mental hospitals. Using V. A. 
ature, the following services, occupations, and professions 
uded: nursing assistants, nurses, Physical Medicine and 
itation Service, Special Services, physicians and dentists, 
trists, social workers, and psychologists. Samples of 50 per 
ent of the personnel of each of the twelve hospitals were randomly 
awn from the obtained sample within each of the eight categories 
d above. The twelve samples were pooled to form the standardi- 
tation sample of 3,149 subjects. 


Summary 


he factorial invariance of the five OMI factors was determined 
ee samples of 400 subjects drawn from the personnel of three 
mental hospitals. The samples were identical in occupational- 
onal composition, but varied greatly in religious and regional 
ground. The samples represented three types of hospitals with 
to patient-staff ratio, architectural style, and size of patient 
tion. Factoring scoring formulas are presented for the opera- 
definition of the five factors along with the internal consist- 
coefficients of the five item composites. A table for transferring 
¥ factor scores to sten scores, allowing cross-factor comparisons, 


REFERENCES 


no, T. W., Frenkel-Brunswik, Else, Levinson, D. J., and San- 
ford, R. N. The Authoritarian Personality. New York: Harper & 
TUM tal Traits.” British 
: ] e Factorial Study of Temperamen aits." Bri 
Journal of Psychology, Statistical Section, I (1948), 178-203. 
field, A. A. “The ‘Sten’ Scale—A. Modified C-scale.” Epuca- 
TIONAL AND PSYCHOLOGICAL MEASUREMENT, XI (1951), 295-297. 
“ohen, J. and Struening, E. L. “Opinions about Mental Illness in 
the Personnel of Two Large Mental Hospitals." Journal of Ab- 
_ normal and Social Psychology, LXIV (1962), 349-360. 
an, H. H. Modern Factor Analysis. Chicago: University of 
саро Press, 1960. : "o 
; H. F. "The Varimax Criterion for Analytie Rotation in 
actor Analysis.” Psychometrika, XXIII (1958), 187-200. 
8, J. О. and Wrigley, C. “The Quartimax Method: An Ana- 
cal Approach to Orthogonal Simple Structure.” British 


298 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Journal of Statistical Psychology, VII (1954), 81-91. 

Thurstone, L. L. Multiple-Factor Analysis. Chicago: University of 
Chicago Press, 1947. 

Tryon, В. C. “Reliability and Behavior Domain Validity." Psycho- 
logical Bulletin, LIV (1957), 229-249. 

Tucker, L. В. “A Method for Synthesis of Factor Analysis Studies.” 
Personnel Research Section Report, No. 984. Washington, D. C.: 
Department of the Army, 1951, 120. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vol. XXIII, No. 2, 1963 


THE DEVELOPMENT OF CRITERIA OF STUDENT 
ACHIEVEMENT! 


EDWIN A. LOCKE 
Cornell University 


Ах important but frequently neglected problem in the evaluation 
of educational achievement has been the meaningfulness of the 
criteria by which such achievement is judged. This is related to a 
Tore general problem in psychology as a whole, namely, how do 
we judge the validity or adequacy of the criteria which we use? 
Hulin (1962) has shown that the use of two different criteria of 
executive success would lead to quite different conclusions. Bass 
(1962) and Ghiselli and Haire (1960) have found that the validi- 
ties of selection tests show marked fluctuations using repeated meas- 
urements of the same criteria over time. Weitz (1961) has stressed 
the importance of using multiple criteria even in experimental 
laboratory studies; he showed that conclusions based on а single 
criterion chosen a priori could be quite misleading. 

In the area of educational achievement we find that grades are 
almost inevitably chosen as criteria of students’ performance by 
researchers and school officials alike. It сап be granted that grades 
have many desirable characteristics such as availability, quantifi- 
ability, at least a minimum degree of comparability from school 
to school, and finally a certain amount of stability over time. 

But grades also have undesirable characteristics as criteria. 


They are immediate rather than intermediate or more nearly 
ы. 


"This paper was completed in partial fulfillment of the degree of Master 
of Arts, Cornell University. The writer is indebted to the late Dr. T. R. Niel- 
m „Cornell University, director of the Cornell NSF program in 1961, for 

aking this study possible, and to his Committee Chairman, Dr. Patricia С. 


Smith, for her help in all stages of this research. 


300 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


ultimate criteria of lifetime achievement. Although studies by 
Bridgeman (1930), Gifford (1928), Husband (1957) and Taylor 
(1958) have found college success (in terms of grades, activities, 
or sports) to be related to (financial) “success” in later life, more 
detailed studies of this nature would be useful. In addition, grades 
may often be merely reflections of an ability to memorize and re- 
produce facts rather than an ability to use facts creatively. Finally, 
grades ordinarily represent achievement in the structured environ- 
ment of the classroom and may tell us little about students’ be- 
havior in less structured situations. 

In view of the importance placed upon education in our cul- 
ture and the importance of scholastic achievement for individuals 
seeking jobs or admission to graduate school, it would seem wise 
to continually re-examine and re-evaluate the criteria which are 
used as indices of such achievement. 


Problem 


The purpose of the research was to compare eight separate 
indices of educational achievement which were taken from a variety 
of sources and which concerned several different kinds of behavior. 
This study was part of a larger testing program. The development 
of predictor scores and the prediction and postdiction of the criteria 
developed will be reported in a later paper. All eight criteria were 
in some way related to the intellectual attainments of the students. 


Subjects | 


The subjects were a group of 122 high school juniors and seniors 
(73 boys and 49 girls) attending a Cornell Summer School science 
training program under the auspices of the National Science Foun- 
dation (NSF). The students were highly selected in terms of both 
aptitude and motivation. The mean Preliminary Scholastic Aptitude 
Test scores were above the 97th percentile (using national norms 
by sex) for both Math and Verbal parts for both boys and girls. 
The average high school grade averages were 93 for both sexes. 
Eighty-one per cent of the subjects had done independent research 
work at high school and 43 per cent had won science fair prizes at 
the local, regional, or national level. 


EDWIN A. LOCKE 301 


Procedure 


The following eight criteria were obtained for each student: 

1. High School Average (HSA). All course grades shown on 
each student’s high school transcript were averaged to get an 
over-all high school average. No reliability estimates were made 
for these averages, but it is likely that they were highly reliable 
since they represented average achievement over a two or three- 
year period. 

2. Summary Application Blank Score (ABS). Each student was 
required to have two of his high school teachers fill out rating forms 
which were then sent to Cornell as part of his application to the 
NSF program. The forms asked for ratings (on a six-category 
scale) on six traits: scientific attitude, curiosity, inventiveness, 
initiative, work habits, and personal relations. The median trait 
intercorrelations computed after summing the ratings of the two 
teachers for each trait were .62 for the boys and .44 for the girls. 
(The ratings of different teachers on the same student were not 
compared since the teachers were from different courses and had 
different opportunities for observation. Also, in some cases the 
raters were not classroom teachers at all but high school guidance 
counselors.) In order to reduce the number of variables and to 
increase reliability, it was decided to sum the scores on the six 
traits to form a single summary score for each student. In view 
of the high trait intercorrelations, it was felt that little informa- 
tion would be lost as a result. The reliabilities of these summary 
scores, estimated from the average intercorrelations among the 
traits? were .91 for the boys and .84 for the girls. 

3. Cornell Grade (CGr). Each student took one of five courses 
offered in the Cornell NSF program (Bacteriology, Chemistry, 
Mathematics, Physics, or Zoology). The final course grades were 
converted to stanines within each course to control for differences 
m grading difficulty across courses. The reliabilities of the final 
C-T— 

2 “, f ” 
ы of tho mumay е ийиди the mdi of e 
total may be used: ты = 11ے‎ (Peters & Van Voorhis, 1940, Formula 


“TF -1 
113, p. 194) where а = the од p An of the test, and ri: is the average 


tercorrelation between the a forms. (This will be as the familiar 
pearman-Brown Formula.) 


302 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


grades, estimated by correlating grades in the first and second 
halves of the courses (or correlating lab grades with exam grades) 
and correcting by the Spearman-Brown formula, ranged from .77 
in Physies to .94 in Chemistry. 

4. Cornell Summary Rating (CSR). Each course instructor was 
asked to rate each student on initiative, thoroughness, ability to ask 
good questions, and imaginativeness. The trait intercorrelations in 
the five courses ranged from .36 to .90 with the median at .70. 
Again, in order to reduce the number of variables and to increase 
reliability, it was decided to form a summary score for each student 
by summing the ratings on the four traits. In view of the high trait 
intercorrelations, it was felt that little information would be lost 
by this procedure. These summary scores were converted to stanines 
to control for level. The reliabilities of these summary scores, esti- 
mated from the average trait intercorrelations, ranged from .80 to 
94 in the five courses (see footnote 2). 

5. Content Analysis Score (CAS). After making ratings on the 
application blanks described in (2) above, the high school teachers 
were asked to give the evidence upon which each of their ratings 
was based. A content analysis of the comments given as evidence 
for these ratings was made, and seven categories were empirically 
derived. The categories were: attitude toward detail, curiosity, 
originality, initiative, persistence, reaction to failure, and foresight. 
The coder reliabilities for the originality and initiative categories 
were both .66 (or .81 corrected). Each individual was given a sum- 
mary score for each category (using the comments of both high 
school teachers together) and the scores were intercorrelated. The 
intercorrelations were low, due largely to the fact that the scores 
were probably unstable (since the comments were "free" responses, 
they could be influenced by numerous factors) and the fact that 
some of the categories were infrequently used. For the sake of re- 
liability, it was decided to sum the two categories which were most 
highly correlated. These were Originality and Initiative. The relia- 
bilities of the total scores, estimated from the correlations between 
these two categories, were .57 for the boys and .67 for the girls 
(see footnote 2). 

6. Productivity Index (P). The Productivity Index was the sum 
of the number of experiments, demonstrations, apparatus designs, 
and library research papers done by each student outside of re- 


EDWIN А. LOCKE 303 


ired class work and of more than three month's duration. This 
nation was obtained from a questionnaire given to each stu- 
t. No reliability estimate was made for this index. 

7. Quality Index (Q). The Quality Index was the weighted sum 
of the number of science fair prizes won by each student. National 
es were given a weight of four, regional prizes a weight of three, 
-wide prizes a weight of two, and school prizes a weight of one. 
second, third, and honorable mention prizes were given equal 
t within each category. This information was also obtained 
а а questionnaire given to each student. No reliability estimate 
was made for this index. 

8. Self-Rated Initiative (Slf. In.). In a questionnaire each stu- 
lent was asked to indicate, on a three-point scale, how often he 
for extra homework problems, did extra reading, read sci- 
ific journals, etc. The sum of the responses to the five questions 
in this area was taken as a measure of initiative. The reliability 
f this total score, estimated from the average intercorrelation of 
five questions, was .67 for the whole group (see footnote 2). 
ince the matrices for the boys and girls seemed similar enough 
warrant the combining of the sexes, the eight measures were in- 
elated using the whole group of 122 subjects. This yielded 
it the same results as computing the group matrix by averaging 
within-course correlations for the five courses (Locke, 1962). 
data were also factor-analyzed by the Centroid method and 
ed by the Quartimax method (Neuhaus & Wrigley, 1954) 
Burroughs 220 Computer. 


Results 


pears in the lower right hand corner. 
upper-left cluster contains both grade scores and both sum- 
у rating scores; the lower-right cluster contains the Productivity 
| Quality indices, Self-rated Initiative, and the Content Analysis 
re. The rotated factor loadings (see Table 2) show the same 
ing with High School Average, Application Blank Score, 
Grade, and Cornell Summary Rating loading most highly 
actor I, and with Content Analysis Score, Productivity, Qual- 


304 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 1 
Intercorrelations of Eight Criterion Variables 
(n = 122) 
HSA ABS CGr CSR CAS P Q Sif. In. 
High School Av. —  .24** 38** 16  —.00 —.12 —02 -.01 
Applic. Blank 
(ona — 21% 25* à .99'" м 08  —.10 
Cornell Grade — 62° 0 13 —.03 .07 
Cornell Sum- 
mary Rating — 12 08 05 .08 
Content Analy- 
М Seore "T — 80% 22° .07 
LT = .34** 30** 
Quality Index £A 24** 
Self-rated 
Initiative ue 
*p «05 
** p <.01 


ity, and Self-Initiated Achievement loading most highly on Factor 
IL. (A third factor, not shown in Table 2, contained the Applica- 
tion Blank Score and Content Analysis Scores, but it was felt that 
this may have been a partly spurious factor since the comments 
were based on and directly relevant to the ratings.) 

The first factor has been named “Structured Achievement” 
since it seems to contain variables which represent behavior in the 
structured environment of the classroom. Grades represent a re- 
sponse to a situation (e.g., homework, test, etc.) rather than be- 
havior initiated from within the individual. The teachers’ ratings 
probably represent their subjective evaluations of classroom and 


TABLE 2 
Factor Loadings of Eight Criterion Variables 
(after rotation) 
(n = 122) 
———————————Є———_—_———Є 
Factor Loadings 
Factor Variables I II Name 
I High School Av. .32 —.16 "Structured 
Applic. Blank Score .28 — .03 Achievement” 
Cornell Grade 84 .01 
Cornell Summary Rating .75 .07 
II Content Analysis Score — .11 E “Self-Initiated 

Productivity Index .10 .64 Achievement" 
Quality Index —.04 54 
Self-rated Initiative .08 50 


EDWIN A. LOCKE 305 


test behavior. The second factor has been named “Self-Initiated 
Achievement” because it appears to represent behavior which is 
initiated by the individual and which is performed outside of re- 
quired school work. The reason the Content Analysis Score was 
slightly more closely related to Self-Initiated Achievement was 
that the high school teachers often used examples of independent 
achievement when giving evidence for their ratings, even though 
the ratings themselves were more closely related to the Structured 
Achievement factor. 
Discussion 

These results are quite similar to those found by Holland (1961) 
in a study of National Merit Scholar finalists. Holland developed 
a criterion of “creativity” by summing the number of original 
papers published, prizes won, apparatuses designed, patentable in- 
ventions made, and papers delivered at scientific meetings for each 
student. It can be seen that this is similar to the Productivity and 
Quality indices (and hence the Self-Initiated Achievement factor) 
used in this paper, The present writer would prefer, however, to 
avoid the use of the term “creative” to describe Self-Initiated 
Achievement. The term implies that “creative ability” is involved 
in such behavior, and yet such environmental variables as encour- 
agement and competence of high school teachers, high school and 
library facilities, and parental guidance were not controlled. In 
View of this, the use of the term “creative behavior" to describe 
Self-Initiated Achievement would seem to be premature. (The same 
criticism would hold for Holland’s use of the term.) Holland found, 
however, that his criterion of “creativity” was unrelated to grades, 
thus paralleling our finding that Structured and Self-Initiated 
Achievement were orthogonal. 

The real importance and meaning of the criterion factors de- 
scribed here can only be determined from follow-up studies of the 
same students over a long period of time. It can be hypothesized 
that those high on Structured Achievement and those high on Self- 
Initiated Achievement might tend to enter different professions or 
that within any given profession those high on Self-Initiated 
Achievement in high school would be more “creative” and inde- 
Pendent than those high on Structured Achievement. 

The one conclusion that we can draw unequivocally, however, 


306 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


is that with one high level group of science students it was shown 
that two independent patterns of achievement could be distingu- 
ished, one involving standard grade achievement across two situa- 
tions and the other involving achievement outside of required school 
work. 

For this group at least, these results would lead one to question 
the adequacy of a single criterion chosen a prior? as really represen- 
tative of these students' educational achievement. In view of the 
similarity of these findings to those of Holland (1961) with another 
highly selected sample and similar criteria, some argument could 
be made for the generality of these findings. 

It might be fruitful to concentrate future research on: a) at- 
tempted replication of these findings with less highly selected groups 
of students, b) the development of other criteria of student achieve- 
ment in addition to those developed here, and c) the relationship of 
various immediate (high school) criteria to performance in later 
(college and postcollege) life. 


Summary 


Eight criteria of student achievement were developed and com- 
pared, using a select group of students taking advanced summer 
training in science at Cornell University. Two orthogonal criterion 
factors were isolated. The first involved high school and summer 
school grades and high school and summer school teacher ratings 
and was named “Structured Achievement.” The second involved 
indices of the amount and quality of independent work done at high 
school and was named “Self-Initiated Achievement.” It was sug- 
gested that the two (immediate) criterion factors might be differ- 
ently related to later (intermediate) criteria of achievement. Some 
problems for future research were delineated. 


REFERENCES 


Bass, B. M. “Further Evidence on the Dynamic Character of Cri- 
ч гоян denied Psychology, XV (1962) , 93-97. 
ridgeman, D. S. “Success in College and Business.” The Personnel 
"E IX (930), 1-19. ay гони 
Ghiselli, E. E. апа Haire, M. “Тһе Validation of Selection Tests in 
the Light of the Dynamie Character of Criteria." Personnel 
Psychology, XIII (1960) , 225-231. 
Gifford, W. S. “Does Business Want Scholars?” Harpers Magazine, 
CLVI (1928), 671-674. 


үш 


EDWIN A. LOCKE 307 


Holland, J. L. “Creative and Academic Performance among Talented 
Adolescents.” Journal of Educational Psychology, LII (1961), 
136-147. 

Hulin, C. L. “The Measurement of Executive Success.” Journal of 
Applied Psychology, XLVI (1962) , 303-306. 

Husband, R. W. “What Do College Grades Predict?” Fortune, LV 
(1957), 157-158. 

Locke, E. A. ^A Study of Criteria of Student Achievement." Paper 
presented at the 33rd Annual Meetings of the Eastern Psycho- 
logical Association, Atlantie City, N. J., April 26-28, 1962. 

Neuhaus, J. О. and Wrigley, C. “The Quartimax Rotation.” British 
Journal of Statistical Psychology, VII (1954) , 81-91. 

Peters, С. С. and Van Voorhis, W. В. Statistical Procedures and 
Their Mathematical Bases. New York: McGraw Hill Book 
Company, 1940. 

Taylor, C. D. “Some Variables Functioning in Productivity and 
Creativity.” In C. W. Taylor (Editor), The Second (1957) 
University of Utah Conference on the Identification Кечын 
А Talent. Salt Lake City, Utah: University of Utah Press, 

, 9-19. 

Weitz, J. “Criteria for Criteria.” American Psychologist, XVI 

(1961) , 228-231. 


Lá 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vol. XXIII, No. 2, 1963 


THE EFFECTIVENESS OF THE ANXIETY DIFFERENTIAL 
IN EXAMINATION STRESS SITUATIONS'^ 


T. R. HUSEK 
University of California, Los Angeles 
AND 
SHELDON ALEXANDER 
Southern Illinois University 


Тн initial steps in the development of an instrument designed 
io measure various types of situationally-aroused anxiety were 
described in a recent report (Alexander & Husek, 1962). That paper 
reported two experiments. In the first, six preliminary anxiety 
seales were developed; in the second, the stimulus and testing con- 
ditions were changed, and the effectiveness of the scales was ex- 
amined under the new conditions. The preliminary scales were re- 
vised, and four Anxiety Differential (AD) item combinations were 
Tecommended for future use. The report concluded that the AD 
measures possessed sufficient reliability and validity to warrant fur- 
ther development and use. 

A number of limitations and cautions were also considered in the 
report. One of these involved the nature of the stimulus situations 
used. In both of the initial experiments anxiety was aroused in the 
subjects by having them look at pictorial descriptions of bodily 
س‎ 


"This research was performed at the University of Illinois. The authors 
Would like to express their appreciation to Charles E. Osgood for generously 
ing available the facilities of the Institute of Communications Research, 
0 Howard M. Bobren for his invaluable assistance in data processing, and to 

оп Dulany and Raymond Frankmann for the use of their classes. 
| А Preliminary report of these results was presented at the American Psycho- 
gical Association meetings, Chicago, September, 1960. . 

The authors shared equal responsibility for the planning and conduct of 
© research, and the preparation of this report. 


310 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


injury. In the first experiment subjects watched a color movie of a . 


surgical operation; the subjects in the second experiment saw slides 
about highway accidents. We deemed it essential to determine 
whether items that were successful in measuring anxiety aroused 
by films about bodily harm would be successful in measuring anxiety 
aroused by other types of stimulus situations. The test items were 
designed to measure cognitive changes, and it was possible that those 
changes might be influenced by changes in the nature of the anxiety 
stimulus. In more general terms: do most types of anxiety-inducing 
stimuli lead to similar cognitive changes, or do different anxiety 
situations produce different cognitive changes? 

A second potential limitation of the instrument concerns the in- 
tensity of the stimulus situation. The initial research had utilized 
rather strong anxiety-arousing stimuli. In addition, these stimuli 
were ones to which most of the subjects would rarely be exposed in 
their everyday lives. It seemed possible that the subjects would be 
less likely to have developed adequate defenses to deal with such 
Situations than they would for other, more common anxiety-pro- 
voking experiences. Would the items selected in the initial research 
still be adequate when subjects faced a weaker stimulus situation 
or a situation to which they were more accustomed? Would the 
subjects be able more easily to hide or falsify their emotional re- 
sponses to such situations? 

The two studies reported here examined these problems by in- 
vestigating the sensitivity of the AD to pre-examination anxiety 
experienced in a college setting. 


Procedure 
Study 1 


The anxiety subjects were 112 sophomore male students in an engi- 
neering mechanics course at the University of Illinois. They re- 
sponded to a 40-item version of the Anxiety Differential. The AD 
test booklets contained almost all of the items selected in the initial 
investigations as anxiety indicators, plus several new items viewed 
as potential anxiety indicators, As before, there were several filler 
items that were likely to confuse the subjects as to the purpose of the 
test. The items were randomly assigned to their positions in the test 
booklet. The subjects responded to the AD a few minutes before 
taking the final quiz in the course. The quiz counted ten per cent 


HUSEK AND ALEXANDER 311 


toward the final course grade. Subjects took between five and ten 
minutes to complete the AD. 

The control subjects were 55 males in two college mathematics 
classes. They responded to the AD booklet during a regular class 
meeting. 


Study 2 


This study was performed after the major portions of the data 
from Study 1 had been analyzed. The anxiety subjects were 126 
males and 111 females in an introductory psychology class at the 
University of Illinois. They responded to a 31-item Anxiety Differ- 
ential just prior to taking the final examination in the course. The 
control subjects were 110 males and 70 females in another intro- 
duetory psychology class. They responded to the AD in a regular 
class session, about one month before the testing of the pre-examina- 
tion group. After completing the AD booklet, both groups of subjects 
were also asked to write down what they thought was the purpose 
of the test. 


Results 


To determine whether the AD measures had differentiated be- 
lween the pre-examination and control groups, means and standa 
deviations were computed for each of the item combinations utilized 
in previous studies (Alexander & Husek, 1962). The results for the 
four tests are presented in Table 1.3 The pre-examination and con- 
trol groups were then compared by means of £ tests, the results of 
which are presented in Table 2. While all item combinations were 
examined, major attention was focused on tests 1-4 since these were 
revisions of scales 1-6 and were based on the data from both of the 
earlier experiments. The four tests‘ significantly differentiated be- 


. aaa 


* A document listing a) all the Anxiety Differential items used in the two 
Studies, b) means pre i$ for pee of subjects for each AD item 
ud in the two studies, с) £ teste for the various item comparisons, and d) 

eans, standard deviations, and ¢ testa for the six "preliminary scales” has 
No deposited with the American Documentation Institute. : 
Chi 461, remitting $125 for 35 mm. microfilm or $125 for photocopies to 
S af, Photoduplication Service, ADI, Library of Congress, Washington 25, 
* It should be noted here, as it was in our first report on the AD, that there 
considerable overlap among the four testa, The extent of the overlap is 

in that first report (Alexander & Husek, 1962). 


312 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


шой 10 eq puo£oq suvogruie sonwa $ es 
ущоа ZO" eq puo&aq 4uvogruie вәпүшл } & 
"Apnis srq} ш posn уоп sem uonvurquioo шәр SYJ, e 
“Кзәтхиз 8,}оә!дпв eq, 42:214 әчү ‘aroos үвошәштпп ejnjosqw UT 42710] eq? ‘poen uro48As Sup008 re[nonyred eg, јо oenwosq “MON 


*«F9'G +98 * SNBA 7 


MEET D. o o o on onte Aa Ee ہو وا کی ہے‎ E 8 ee te ER ھی‎ EE IEEE 
089 0Г9„ OFOI OF ZL 092 OTS 076 0212 . . < : esodmmd-pry "y 389, 
ab OS ITE РАА Senyea 7 
P94 98:69 68 1998 079 7869 cg 998 908 86:68 Y9L оге CELO 
+81 CES А а son[A } 
OLZ 09°28 0907 0878 06Z 0748 09'6 0&%8 80:01 6044 сс6 6Р%/ ZL 
p = «IF 9496'S *©6@ songea 7 
08Z 09°98 OLTI 0/718 OFS 0078 OT'OT 07°08 2901 co'es ©Є`01 8682 "Т 3991, 
"as и ‘as И ‘as W ‘as W ‘as A ‘as и чопц®шчшогу 
2 103100) 4Aqarxuy 103100) Ayorxuy 1023100) Ayorxuy шә 
کا کے‎ 
" вәјешӘд SALW 
Р z Apmg I Apmg 


sdno4t) usomjoq мовыр@шогу эү} 4of sanyo, з Чт Duo] y. soanspo]g fijposzuy әү} fo q203 о] гиоуурзлә@ paopuvig puo sune JY 
І SISVIL 


HUSEK AND ALEXANDER . 313 
tween the anxiety and control groups in both Study 1 and Study 2 
(except for test 4, which was not used in Study 1). Actually, the 
results for the six “preliminary scales” were quite similar to those 
for the revised tests. In Study 1, all but scale 4 yielded significant 
differences. In Study 2, all but scale 4 showed significant differences 
for the males, and all but scales 4 and 6 for the females. The relative 
weakness of scale 4 has been dealt with elsewhere (Alexander & 
Husek, 1962). 

The internal consistency of the various item combinations was 
examined by using Alpha Coefficients (Cronbach, 1951), computed 
for the anxiety groups only. The Alphas are presented in Table 2. 
Those for the four revised tests range between .58 and .80 with a 
median coefficient of .68. 


TABLE 2 
Alpha Coefficients for the Various Item Combinations 
Study 1 Study 2 

Item Combination Males Females 
Test 1 7 68 .80 
Test 2 .66 .60 .69 
"Test 3 .66 .58 78 кт 
Test 4 All-purpose Test = 66 78 mud 


* Not utilized in this study. 


There were several problems that could be examined only in Study 
2 since no data bearing on these matters had been collected in Study 
1. The two most important questions concerned (a) the subjects’ 
ability to guess the purpose of the AD, and (b) sex differences in 
Tesponding to the AD. 

The subjects in Study 2 had been asked to write down what they 
thought the questionnaire was trying to get at. Any responses re- 
lated to anxiety, fear, tension, nervousness, insecurity, being upset, 
ete., were classified as having correctly stated the aim of the AD. 
Responses that had no relation to anxiety were scored as incorrect. 
Eight per cent of the control group (14 of 180 subjects) and 32 per 
‘ent of the pre-examination group (77 of 237 subjects) were able to 
Suess the purpose of the AD. A chi-square analysis indicated that 
the difference between the control and anxiety subjects was sta- 
tistically significant (х2 = 36.67, p<.001). This finding led to the 
Question of the relationship between the subject's ability to state 


314 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


the purpose of the test and his score on the instrument. Would 
subjects who could correctly state the purpose of the test obtain 
different, anxiety scores from subjects who could not guess what the 
test was intended to measure? This problem was investigated by 
means of point-biserial correlations. The anxiety scores were cor- 
related with whether or not the subject had guessed the purpose of 
the AD. These correlations, computed separately for males and fe- 
males, ranged between —.11 and +.08 and were not significantly 
different from zero correlations. 

Since previous results had suggested that the AD might be less 
sensitive for female subjects (Alexander & Husek, 1962), both male 
and female subjects were tested in Study 2. While the AD differ- 
entiated significantly between female pre-examination and control 
groups (Table 1), the obtained differences were smaller for females 
than for males. Therefore, the mean anxiety scores for males and 
females were compared for the anxiety group and for the control 
group by means of £ tests. The obtained t values ranged from —.19 
to +1.20, and thus in no case was a significant t found. 

The results for individual items were then examined for sex 
differences. Comparisons between males and females were made for 
each item. These £ tests were performed both in the pre-examination 
group and in the control group. Of the 31 items in the AD booklet, 
there were statistically significant differences (p<.05) between men 
and women for seven items in the control group and for eight items 
in the anxiety group. Six of the seven items yielding significant sex 
differences in the control groups also yielded significant differences 
in the same direction in the anxiety groups. Each of the tests dis- 
cussed earlier (tests 1-4) contained at least two of the items yield- 
ing sex differences, but none of the tests contained more than four 
of these items. 

As in all previous studies of the AD, each individual item in the 
test booklet was re-examined to determine its sensitivity to the 
anxiety situation being studied. The item data for the three groups 
of subjects (Study 1 males, Study 2 males, Study 2 females) were 
examined. For each item, f tests were computed between the 
anxiety and control subjects. Eighteen items that possessed adequate 
strength and were consistent in all three groups were then selected 
to constitute а measure of examination anxiety. These items are 
presented in Table 3. This “examination anxiety" test is a refine- 


HUSEK AND ALEXANDER 315 
ment and extension of the item combinations developed in previous 
studies of other types of anxiety situations. Of the 18 items listed 
in Table 3, twelve were already on tests 1-4. Four had been used in 
previous research. The two remaining items had not been used in 
ious research, but they were added to the test booklet in the - 
pectation that they might prove to be good items. 


TABLE 3 
Items Selected to Constitute New AD Measure of Examination Anxiety 


FINGERS: straight—twisted* 

ME: helpless—secure 

BREATHING: tight—loose 

SCREW: strong—4eak 

HANDS: wet—dry 

TODAY: loose—tight 

ME: frightened—fearless 

GERMS: deep—shallow 

HANDS: good—bad 

BREATHING: careful—carefree 

FINGERS: stiff—relaxed 

ME: calm—jittery 

HANDS: tight—loose* 

BREATHING: hot—cold* 

SCREW: loose—tight 

ME: carefree—worried 

ANXIETY: clear—hazy e 
FINGERS: loose—tight* 

Note. The italicized adjective indicates the anxiety side of the scale. 

* These items shoul i is bei with female subjects. They 
Е" data се ads adi pulos o tort a н or if only male subjects 


Discussion 

ety Differential and Measurement of Examination Anxiety 
The results of Studies 1 and 2 indicate that the item combina- 
Mons developed to measure bodily harm anxiety were also rea- 
Sonably sensitive to pre-examination anxiety. The tests were able 


`0 differentiate significantly between the anxiety and control 
Broups The internal consistency of the measures was also ex- 


X. In the time since the studies reported here were analyzed, two studies have 

(î Performed using the new AD measure of examination anxiety as an index 

& anxiety. In both of these studies the measure discriminated betwee 
dination stress and non-examination conditions. One of the studies is in 


mnt (Wittrock & Husek, 1962); the other has just passed the data analysis 


316 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


amined, and the obtained coefficients are viewed as adequate for an 
instrument at this stage of development. However, future work 
should definitely include attempts to improve the reliability of the 
instrument. 

One of the primary questions that led to these two studies con- 
cerned the generality (or specificity) of the AD items. Were the 
cognitive changes induced by the arousal of bodily harm anxiety 
the same as those induced by examination anxiety? The data indi- 
cated that there was a sizable amount of commonality. On the other 
hand, there was also evidence that some of the cognitive responses 
were not the same. A few of the items behaved quite differently from 
the earlier studies of bodily harm anxiety. One item, “GERMS: 
deep—shallow,” yielded results in both Studies 1 and 2 which were 
opposite to the results obtained in the earlier research on bodily 
harm anxiety; that is, the anxiety-control group differences were in 
the reverse direction. There were four other items that had shown 
large and consistent experimental vs. control differences in the 
studies of bodily harm anxiety, but which yielded insignificant 
(MOVIES: loose—tight) or inconsistent (DREAMS: loose—tight, 
MY MIND: loose—tight, SCREW: nice—awful) changes in the 
present studies. In addition, four items in the suggested new measure 
of examination anxiety (TODAY: loose—tight, HANDS: tight— 
loose, ME: carefree—worried, and ANXIETY: clear—hazy) were 
used in our bodily harm research but did not differentiate well 
enough at that time to be included in the tests based on that re- 
search, 

Thus, while the majority of items appear to be cross-situational, 
there are a number of items that clearly are not, This is a matter on 
which attention must be focused in future research utilizing the AD. 
In addition, the fact that the majority of items on the AD tests are 
sensitive to bodily harm and examination anxiety does not reduce 
the need for investigating responses to other types of anxiety (e.g 
moral anxiety, rejection anxiety, aggression anxiety). 


structed on the basis of only one testing in each group. The results of our 
earlier research suggested that although the AD was sensitive as a one-time 
measure, it was more sensitive to induced anxiety states if pre-post changes 
were measured. It is the authors’ opinion that, wherever possible, both before 
and after measures should be obtained. Users should also include some filler 
items with the anxiety items, We have normally used 10-15 such filler items 
(which can be found in the ADI document associated with our earlier paper). 


HUSEK AND ALEXANDER 317 


Sex Differences 


Since an earlier experiment had indicated the possibility of sex 
differences, this question was examined again. While there were no 
significant differences between males’ and females’ test scores, the 
differences were in the same direction as in the earlier research 
(females having obtained lower scores than males). In addition, 
there were significant differences between males and females on 
several of the individual items. Sex differences were found for ap- 
proximately one-quarter of the items. Fortunately, most of these 
were consistent in both the anxiety and control groups so that 
anxiety-control comparisons were not substantially influenced. How- 
ever, these findings support previous indications that sex and 
anxiety scores may often interact, and the authors would reinforce 

eir earlier recommendation that in experiments involving anxiety, 
male and female responses should be analyzed separately. 


Response Distortions 


In the use of any self-report instrument, the question of distortion 
of responses (response sets, “faking,” etc.) is a crucial one. The 
Tesults of Study 2 yield some evidence related to this matter. First, 
it appears that a majority of the subjects did not know what the in- 
strument was attempting to measure.’ This would, therefore, make it 
тоге difficult for the subjects consistently to distort their responses. 
Second, it was found that there was no relation between anxiety 
scores and knowing what the instrument was attempting to measure. 
This suggests that conscious faking (trying not to give anxiety re- 
sponses) did not occur. The procedure we used did not, of course, get 
at all possible forms of response distortion, but the results do sug- 
gest that the Anxiety Differential is not readily susceptible to con- 
Selous attempts at consistent response distortion. 
> ee 


' This was true even though there were a number of factors working in 
favor of the subjects’ — guessing the AD's purpose. These subjects 
Were not “naive.” They had just completed а psychology course, and, in addi- 


318 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Summary 


` Two studies investigated the applicability of the Anxiety Differ- 
ential to situations involving pre-examination anxiety. The data 
from both studies indicated that the Anxiety Differential was suffi- 
ciently reliable and valid to be used in measuring examination _ 
. anxiety. А new combination of items, designed specifically for ex- 
amination anxiety, was developed. 
The results of the second study indicated that responses to the 
Anxiety Differential were not consistently falsified by subjects. 
Most subjects did not guess the purpose of the test; and, even where 
they did, there was no effect on subjects’ anxiety scores. Sex differ- | 
ences in responding to the AD also were noted. While these were not 
of major dimensions, it was recommended that data from males and, 
females be analyzed separately in anxiety studies. ; 


REFERENCES 


Alexander, S. and Husek, T. В. “The Anxiety Differential: Initial 
Steps in the Development of Measures of Situational Anxiety." 
созоюн AND PSYCHOLOGICAL MEASUREMENT, XXII (1962), 

Cronbach, L. J. “Coefficient Alpha and the Internal Structure of 

` Tests.” Psychometrika, XVI (1951) , 297-334. 

Wittrock, M. C. and Husek, T. В. “Effect of Anxiety upon Retention 


of Verbal Learning." Psychological Reports, 10, 78. Southern 
Universities Press, 1962. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vol. XXIII, No. 2, 1963 


THE GRADING BEHAVIOR OF А COLLEGE FACULTY 


LEWIS R. AIKEN, JR. 
Woman's College of the University of North Carolina 


Mucu has been written about errors in rating individuals. Terms 
Lach as “central tendency, irame of reference, and adaptation 
level" (Hollingworth, 1910; Helson, 1948) are used to label the 
lendeney to make judgments in relation to some “anchor,” rather 
than absolutely. 

One of the most popular rating devices is the five-point scale, 

- consisting of “A, B, C, D, and F,” or some simple variation on this, 

"which is used periodically by thousands of raters called teachers 
to judge the behavior of millions of ratees called pupils or students. 
It is perhaps unfortunate, but true, that for many students grades 
are the important thing which one gets from school. Thus, although 
attempts are made to reassure students about grades, they still 
spend quite a lot of time worrying about them. Students talk about 
the grading procedures of various teachers, compare grades, and 
plead for grades, They realize that important decisions affecting 
their lives hinge on grade-point averages. Consequently, it appears 
Potentially worthwhile to examine the methods currently used in 
grade assignment. 

It is granted that more information usually goes into the making 
of judgments in the case of grades than in the usual check list of 
characteristics, Teachers have many opportunities to observe and 
evaluate the performances of their charges. However, often no abso- 
Tute standards as to what constitutes an A, B, ©, D, or F exist in 

| their heads. When experienced teachers are questioned about their 
grading procedures, they may either refer vaguely to the “perform- 
ance of past classes” or candidly admit that they “curve the grades.” 


319 


320 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


It is the purpose of this paper to show that, whatever teachers may 
say, they usually grade with reference to the existing ability level of 
their students, i.e., intuitively or statistically, they “curve” their 
grades. Although this may sometimes be the fairest procedure, the | 
meaning and interpretation of such grades, when the ability level | 
changes annually, may be a problem of some concern. 


The Frame of Reference in Grading: An Example 


Webb (1959) has discussed the "freezing" of grades at one col- 
lege. Selection of students of greater ability was not followed by 
higher grades. Presently, the same situation exists at the Woman's 
College of the University of North Carolina, and probably else- 
where as well. 

In 1959, the Woman's College began selecting students on the 
basis of a multiple regression equation, which consisted of assigning 
numerieal weights, based on the 1958 freshman class, to three 
predictor variables: Scholastic Aptitude Tests—Verbal (SAT-V), 
Scholastic Aptitude Tests—Mathematical (SAT-M), and a con- 
verted two-digit score of rank in high school graduating class 
(HSR). This equation, Predicted Grade (PG) = .037 SAT-V + 
010 SAT-M +- .328 HSR — 21.98, yielded an R of .70 and was 
used to predict freshman average grades. Due to the progressively 
decreasing selection ratio for the years 1959, 1960, and 1961, the 
mean scores on SAT-V, SAT-M, and HSR for students who were 
admitted progressively increased, but increases in the means of the 
predictor variables were not accompanied by an increase in the 
criterion mean. Table 1 depicts this situation quite clearly. It shows 
that although mean SAT-V, SAT-M, and HSR and, consequently, 
PG, rose steadily from 1959 to 1961, the actual freshman average 
grade (FAG) mean did not follow suit. 

Table 2 demonstrates that all increases in the means of the 
predictor variables were statistically significant, by Student t-tests, 
but the criterion mean was relatively stable. 

These results demonstrate quite clearly that the grading behavior 
of this faculty was not based on standards that held constant over 
the years. The grading standards shifted with the ability level of 
the class, being more stringent in each successive year. Thus, the 
faculty as a whole had no implicit standards of good and poor рег- 


LEWIS В. AIKEN, JR. 321 


TABLE 1 


«ins and Standard Deviations of Scholastic Aptitude Tests—V erbal (SAT-V) and 
Mathematical (SAT-M), Rank in High School Graduating Class (HSR), 
Predicted Grade (PG), and Freshman Year Average Grade (FAG) with 
Multiple Correlations (R) for Three Years 
Gsxr aa 


Year 

1959 (N = 738) 1960 (N = 894) 1961 (N = 953) 

Standard Standard Standard 

0 Mean Deviation Меш _ Deviation Mean Devia 
SAT-V 453.90 87.27 480.71 83.05 492.66 77.16 
SAT-M 453.94 71.92 469.89 76.26 485.26 74.59 
HSR 60.61 7.09 61.37 6.54 62.60 6.30 
PG 19.23 5.29 20.63 5.39 21.63 .92 

FAG 19.45 6.90 19.49 6.90 19.68 


formance. The "standard" was dictated by the quality of students 
inthe eurrent freshman class. 


Conclusions 


With perhaps some individual exceptions, faculty members' no- 
tion of “the average student,” who receives a grade of “О,” is that 
of a person average in his group. Of course, it may be pointed out 
that a “O” means different things at different colleges so it is not 
inappropriate that its meaning changes annually within a given 
tollege. However, one problem which this cavalier acceptance of 
changing standards does not solve is how one goes about explaining 
it to outsiders, e.g., parents and prospective employers. Many of 


TABLE 2 


Student “ч” Tests of Significance of the Differences Between Means for 
Predictor and Criterion Variables for Three Years 


Comparison 

1959 with 1960 (df = 1630) 1960 with 1901 (df = 1845) 

E o 
к NE NET 

Med Grice 2.25 И сш t 
60 


01 


322 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


these people, not being cognizant of such shifts or the reasons for 
them, continue to think of grades as something firm or fixed and 
judge or select students accordingly. It will be noted from Table 2 
that the mean rank in high school graduating class increased sig- 
nificantly from year to year, and so there are no presentiments to 
parents at this level. If these shifts in grading standards, which 
make college more difficult in each succeeding year for students 
having the same abilities, are to continue, then such information 
should be made available to outsiders as well as to those within 
the teaching profession itself. 


REFERENCES 


Helson, H. “Adaptation Level as a Basis for a Quantitative Theory 
of Frames of Reference.” Psychological Review, LV (1948), 297- 
313. 


Hollingworth, Н. L. “The Central Tendency in Judgment.” Journal | 


of И Psychology, and Scientific Methods, VII (1910), 
Webb, S. C. “Measured Changes in College Grading Standards.” 
College Board Review, No. 39 (1959) , 27-30. 


mw و و وف‎ u eee ii. 
پو‎ u ww و ق‎ 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
XXIII, No. 2, 1963 


THE USE OF TEACHING MACHINES IN DEVELOPING 
AN ALTERNATIVE TO THE CONCEPT 
OF INTELLIGENCE 


A. GARTH SORENSON 
University of California, Los Angeles 


Turn are good reasons why school psychologists and teachers 
might well abandon the I.Q., especially since it appears that a better 
concept сап be developed. These reasons are in part technieal, 
having to do with ambiguities in definition and with untenable 
assumptions which have been made regarding such matters as the 
contribution of heredity, the constancy of the LQ., the role of cul- 
in defining intelligent behavior, and the function of non-intel- 
lective components. The technical and theoretical limitations of the 
М neept of 1.0). have been well documented (Anastasi, 1950; Cron- 
Pach & Meehl, 1955; Hells, et al, 1951; Guilford, 1956; Liverant, 
1960; Spiker & McCandless, 1954). 
- Other reasons for abandoning the concept of LQ. have to do with 
28 Practical consequences. Sectioning and grades are only too often 
auuenced by teachers having consciously or unconsciously assigned 
Her students to rigid intellectual categories on the basis of test 
Mores. Observant psychologists will be aware that most college 
E. ofessors, teachers, parents, pupils, and indeed many school coun- 
к" Still regard intelligence tests as measures of some innate, 
pr Mant and general, if illusive, intellectual capacities, the limits 
' Which are determined by heredity. These practices and beliefs 
E st in spite of conceptual advances in both psychology and 
Betis and in spite of the evidence which is constantly being 
Mpplied by sophisticated test users—evidence which should lead to 
А кч of the folklore about intelligence tests. The fact that 
fessional test makers have gradually replaced the term “intelli- 


324 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


gence” with such labels as “scholastic ability" or “academic apti- 
tude" has apparently escaped the notice of most consumers of tests. 
The significance of the fact that modern I.Q.'s are in fact standard. 
scores, and not the result of dividing mental age by chronological 
age and multiplying by 100, is not understood by most of those who 
interpret test scores. Because the term "intelligence," as Spiker 
and McCandless (1954) have pointed out, was taken by psycholo- 
gists from the natural language, it has always had too many mean- 
ings, all of which are too vague. Concepts do have consequences, 
for people must act on their beliefs. The erroneous beliefs about the 
limitations of the human organism which cluster about the popula 
concept of intelligence can do and have done a good deal of damage 
to the self-concept of many individual students. These erroneous 
beliefs can and have impaired the effectiveness of many teache 
who, because they assume that a given test score proves that some 
of their students are unable to learn, give up trying to teach those 
students. It seems clear that teachers and counselors should elimi- 
nate the terms “intelligence” and “I.Q.” from their vocabulary. 

Teachers and counselors will continue to need more accurate and 
comprehensive ways of estimating a pupil’s ability to learn. Changes 
in our society resulting from technological advances, from the de- | 
velopment of new occupations, and from the explosion of knowledge, 
make it ever more necessary to be able to predict in advance how & 
particular student is likely to perform under a variety of circum- 
stances. If we are to give up our faulty tools, we will have to look 
for some that are more effective. The best available standardized 
aptitude and achievement tests are helpful but offer only a partial 
solution, We must develop better appraisal procedures. I would 
suggest that one of the movements to which counselors should give 
considerably more attention is the continuing advance of pro- 
grammed instruction—the so-called teaching machines. Those who 
are interested in programmed instruction are quick to admit tha 
the instructional materials are more important than the “hardware” 
—the method of presentation. I would like to suggest that the ma- 
chines themselves offer intriguing opportunities for developing 4 
concept potentially more useful than the concept of intelligence. It 
is the purpose of this paper to suggest some of the possibilities which 
are to be exploited if only we are willing to devote the necessary 
time and effort. 


на a 


A. GARTH SORENSON 325 


Learning Ability Reconsidered 


One common sense definition of intelligence is “ability to learn 
from experience.” This definition has seldom been used by profes- 
sional psychologists because it is too broad, and because in the past 
it did not lend itself readily to operational definition. However, it 
may be that the time has come to reconsider and to develop the 
concept of ability to learn, or specifically the concept of learning 
rate, At least in school subjects it may be possible, because of recent 


| developments in programmed teaching and “teaching machines," to 


assess rate of learning in a way which will be more meaningful and 
applicable than ever before. 

In one version of programmed teaching, the branching programs, 
the material to be learned is placed on microfilm and shown on 
the sereen of the teaching machine in small, logical units. After а 
student has studied a unit, he presses a button and a multiple- 
choice question appears on the screen. The student then presses 
one of five buttons, numbered to correspond with the alternatives 
on the test question, to indicate which of the answers he believes 
to be correct. Thus he is tested immediately after studying each 
unit. The test result is used automatically to control the material 


. le will see next. If the student answers the test question correctly, 


he is so informed by the machine’s flashing a green light, and he is 
automatically given the next unit of information and then the next 
question, If he fails the test question, the preceding unit of infor- 
mation is reviewed, the nature of his error is explained to him, and 
P 18 retested, a separate set of correctional materials being pro- 
vided for each of the incorrect alternatives. Some of the machines 
tan be made to print on a paper tape a complete record of each 
student’s sequence of choices and the amount of time spent, or 
Tumber of attempts made on each problem. As Crowder (1960) 
has stressed, the essential feature of this type of machine as an 
Еа deviee is the "feedback control which determines 
i ether the student has learned a given unit, and which then takes 
Ppropriate action, presenting either new or corrective material on 
basis of that determination. 
к> critical feature of the teaching machine from the point of 
th of this paper is the printed record of the amount of time and 
€ number of tries or errors required by the student to complete a 


326 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


given program-assignment. A student's record over a number of 
assignments, together with the appropriate statistical treatment to 
compare it with the records of other students who have completed 
the same work, should constitute a useful indication of his learning 
rate in a given school subject. In theory at least, such an index 
would possess a number of advantages not possessed by the stand- 
ard scores on existing intelligence tests, the scores commonly re- 
ferred to as І.О). Some of those advantages are listed below. 


Advantages of a “Teaching Machine” Measure of Learning Rate 
Better Control of the Conditions of Learning 


When he administers an intelligence test, the psychologist samples 
the knowledge and skills which his subject has learned in the past, 
sometimes years ago. Neither the psychologist nor the subject can 
know very much about the conditions under which the learning 
occurred, nor what specific experiences were significant, nor how 
much time was involved. In any society, some subjects will have 
had opportunities and experiences not shared by others, because of 
differences in social class, etc. By contrast, if learning rate can be 
inferred from the record kept by an automated teaching device, а 
great deal will be known about the conditions under which a subject 
learned or failed to learn, and there will be a more precise record of 
his progress over a period of time than has ever before been avail- 
able. 


A More Precise Definition 


In developing the concept of learning rate, as defined by the use 
of the teaching machine, it should be possible to draw on the experi- 
ence and research findings of those who developed the concept of 
intelligence, in order to construct a sounder, if more limited, con- 
cept than that of LQ. Specifically, it should be possible to avoid 
much of the unplanned-for excess meaning which, in the minds of 
most laymen and some psychologists, has accompanied the term 
intelligence. Furthermore, some of the unwieldy assumptions which 
are implicit in the concept of intelligence would not be made; e.g., it 


would not be assumed that all pupils of a given age level have had _ 


an equal opportunity to learn, The operations which would define 


the concept of learning rate could be made more precise than has | 


been the case with the concept of intelligence. There is at present à 


| 
| 


A. GARTH SORENSON 327 


more adequate methodology for such purposes than was available to 
Binet а half-century ago. 


More Productive Research 


It should be possible now to plan research in such a way as to 
avoid some of the sterile controversies and unanswerable questions 
unanswerable because of logical inconsistencies within the con- 
cept of intelligence—which have been a part of the history of 
intelligence tests. In developing a concept of learning rate, some of 
these questions might well be by-passed or else stated in more 
meaningful terms. For example, the problem of developing a “cul- 
ture free” test would probably disappear. The question as to how 
much the I.Q. changes as a result of experience would be replaced 

` by an even more important question, to the educator at any rate, 
аз to what kinds of experiences or teaching techniques would bring 
about the greatest changes in learning rate. 


A Better Predictor of Future Academic Performance 


| It seems likely that higher predictive validity could be developed 
3 the measure of learning rate than in LQ., at least when the 
criterion variable is academic performance. In estimating the stu- 
dents learning rate in a particular subject, it would be highly 
feasible to take into account a much larger sample of his perform- 
ance than is the case with intelligence tests. Or, to put it in other 
terms, learning would reflect actual performance over а longer period 
of time, in a wider variety of academic activities, and in activities 
Very similar to the kind of future performance which is to be 
Predicted, 


A More Useful Diagnostic Tool 


As the programs for teaching machines are improved and stand- 
ибей, it seems reasonable to expect that a student's record of 
ping rate would have “diagnostic” values. For example, the 
Eu rate should prove helpful in identifying students who need 

tional tutoring, or remedial work, or who are experiencing а 

Pse in interest or motivation. Examination of the records would 
Mpply information about the kinds of errors individual students 
may be making, revealing inadequacies in their ways of tackling 
Problems or faulty sets that may be blocking understanding. Also, а 


328 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


combination of achievement test scores, together with the learning 
rate, might well provide a more effective basis for grouping students 
for instructional purposes or at any rate provide additional and 
more compelling evidence on the controversial topic of ability 
grouping than is afforded by existing procedures. 


Other Practical Advantages 


Learning rate would appear to have a number of practical ad- 
vantages over 1.Q. Special testing sessions would not be required 
since learning rate would be estimated from the record of regular 
instruction. Skilled examiners would not be required to obtain 
learning rate since the teaching machine would provide greater 
standardization in the “testing” process than can be achieved by 
an individual examiner. Of course the development of the teaching 
program requires unusual skill, but more and better programs are 
constantly being made available and this trend promises to con- 
tinue. Learning rate would perhaps be less subject to distortion 8$ 
a result of the tensions and anxieties which some students experience 
when they take a special examination or engage in ап activity 
which is not a part of their daily routine. 

It is not intended to suggest that there would be no special or 
serious problems to be solved in developing the concept of learn- 
ing rate. There would be problems, only some of which could be 
anticipated in advance. For example, can an estimate of learning 
rate be made outside of a school situation? Would it be more effi- 
cient to develop a special set of programs for the purpose of estimat- 
ing learning rate instead of using regular classroom programs, 88 
proposed above? What about the problem of motivation—might 4 
program which a student found very interesting result in а higher 
rate of learning than one which he found dull? How are programs to 
be standardized? Are some subjects inherently more difficult to 
learn because they are more abstract? How many variables will be 
needed to define learning rate? What will be the correlation be- 
tween learning rates in unrelated subjects; e.g., mathematics and 
history, etc.? These and others are the kinds of questions which 
might be answerable by the development of programmed teaching: 
They suggest that the teaching machine provides intriguing data, 
almost as a by-product, which may be useful in the exploration 0 


A. GARTH SORENSON 329 


human behavior, motivation, and potentials well beyond that which 
js possible with traditional techniques. 


Summary 


There are many good reasons for discarding the concept of the 
IQ. It is proposed that the teaching machine provides a measure 
of learning rate which сап be precisely defined. The basic data for 
estimating the learning rate are to be found in the record kept by 
the machine of the time spent, or the number of errors made by a 
student in learning a unit of subject matter. A concept of learning 
rate developed from these data would have both theoretical and 
practical advantages. It would be more precise in meaning and 
_ relatively free from the folklore surrounding the 1.0. Because it 
"would be based on larger samples of behavior, it would have higher 
reliability than the LQ. It would probably be a better predictor 
of future academic performance, as well as a more useful device for 
diagnosing a student’s academic strengths and weaknesses. 


REFERENCES 


Anastasi, Anne. “Some Implications of Cultural Factors for Test 
Construction.” Proceedings of the 1949 Conference on Testing 
Problems, Educational Testing Service, 1950, 13-17. , 

Cronbach, L. J. and Meehl, Р. E. “Construct Validity in Psycho- 

C logical "Tests." Psychological Bulletin, LII (1955), 281-302. 

Towder, Norman A. "Intrinsically Programmed Teaching De- 
vices." Proceedings of the 1959 Invitational Conference on Test- 

E ing Problems, Educational Testing Service, 1960. 

ells, D., Davis, A., Havighurst, R. J., Herrick, V. E. and Tyler, R. 
Intelligence and Cultural Differences. Chicago: University of 

q, Chicago Press, 1951. { . 

uilford, J. P. “The Structure of Intellect.” Psychological Bulletin, 

1; LIII (1956), 267-293. 

ierant, Shephard, “Intelligence: A Concept in Need of Re-ex- 
Eastin." Journal of Consulting Psychology, XXIV (1960), 
pie, E C. and McCandless, B. В. “The Concept А Intelligence 
e Philosophy of Science." Psychologica eview, 
(1954), 255-266. = a 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vol. XXIII, No. 2, 1963 


À COMPARISON OF ACHIEVEMENT SCORES OF PUBLIC- 
SCHOOL AND CATHOLIC-SCHOOL PUPILS 


ROBERT H. BAUERNFEIND 
National College of Education 
AND 
WARREN 8. BLUMENFELD 


Purdue University 


Introduction 


IN educational circles one frequently encounters questions con- 
cerning the relative effectiveness of public-school instructional pro- 
grams and parochial-school instructional programs. The purpose 
of this study was to investigate the possibility that there might be 
group differences in educational achievements for matched groups 
of public-school pupils and Catholic-school pupils at the eighth- 
grade level, 


The Measuring Instrument 


Two editions of the SRA High School Placement Test (Science 
Research Associates, 1959, 1960) were used as the instruments of 
vestigation in this study. The 1959 edition provided three scores, 
And the 1960 edition provided four scores, as follows: 


1. Non-Verbal Reasoning (both editions). A measure of gen- 
eral intelligence that requires no reading, arithmetic, or other 
school-learned skills; each item presents five geometric forms, 
and the pupil responds to each item by marking the one geo- 
metric form that does not belong with the other four. The 
Non-Verbal Reasoning scores were used as а matching vari- 
able in this study. The item type appears to correspond to 


332 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Spearman's g factor, and to position E on Cronbach's spec- 
irum of scholastie aptitude tests (Cronbach, 1960, p. 235). 

2. Language Arts (1960 edition only). A measure of skills in 
spelling, grammatical usage, capitalization, and punctuation. 
The Language Arts scores were treated as a criterion variable 
in this study. 

3. Arithmetic (both editions). A measure of arithmetic skills 
in addition, subtraction, multiplication, and division—involv- 
ing whole numbers, fractions, and decimals. The Arithmetic 
scores were treated as a criterion variable in this study. 

4. Reading (both editions). A measure of reading comprehen- 
sion and vocabulary-in-context. The Reading scores were 
treated as a criterion variable in this study. 


The test battery is administered by subscriber schools each spring 
to provide comparable test data for eighth-grade pupils who will be 
entering high school the following autumn. All of the tests cited 
above are given in one sitting, and all pupil scores are reported as 
Grade Equivalents in this investigation. 


The 1959 Study 


The 1959 edition was administered in the spring of 1959, and 
test records were available for some 80,000 children attending pub- 
lic schools and some 60,000 children attending Catholic schools. A 
sample of 1,000 pupils attending Catholic schools was drawn from 
these records, and a sample of 1,000 pupils attending public schools 
was then drawn to match the Catholic sample. The matching vari- 
ables were (a) performance on the Non-Verbal Reasoning score, 
(b) sex, and (c) geographic region of the United States. (It was 
later determined that 12 pupil records were incomplete with respect 
to the achievement scores so that the final matched samples totaled 
988 each.) 

The Grade Equivalent means, standard deviations, and inter- 
correlations for both 1959 samples are reported in Table 1. Table 1 
also shows the results of tests for significance of difference between 
the mean Grade Equivalent scores of the two 1959 groups. The t 
tests are based on the formula for uncorrelated data. 

With the two groups matched for general reasoning ability, SeX: 
and geographic region, the Catholic-school pupils scored a mean 


a —— Q—— À— A! ———————————— ÁáüÀ —""— —— -—4Á——'I€—N!—À kk 


BAUERNFEIND AND BLUMENFELD 333 


TABLE 1 


Analysis of Mean Differences in Achievement Scores 
for the Two Groups in the 1959 Study 


(М = 988 in each group) 


Intercorrelations Diff. 
س‎ Between 
Non-Verbal Parochial — ,43 .42 8.57 2.39 0.00f 0.00 
: Public — .55 .50 8.57 2.39 
Arithmetic Parochial .43 — .65 10.02 2.47 0.98 8.32** 
Public 55 — .72 9.04 2.76 
Reading Parochial 42 .65 — 9.76 2.10 11:18 10.66** 
Public .50 .72 — 8.63 2.51 


T Matching variable. 
* Significant at the .01 level (2.581). 


Grade Equivalent approximately 1.0 years higher than the public- 
school pupils on the two criterion achievement tests, both differences 
exceeding the .01 level of confidence. 


The 1960 Study 


The 1960 edition was administered in the spring of 1960, and test 
records were available for some 120,000 children attending publie 
schools and some 100,000 children attending parochial schools. A 
full replication study—exeept for a modified sampling plan—was 
undertaken using cases from this new universe of test data. 

Again test records for 1,000 public-school pupils and for 1,000 
Catholic-school pupils were drawn from the total universe of 1960 
test data. In this study, however, pupils for both groups were 8è- 
lected from an a priori plan for full representation of geographic 
Tegion, sex, and Non-Verbal Reasoning scores. Thus, in the 1959 
study the universe of Catholie pupils tested was the basis for sam- 
Ping; in the 1960 study a national-sampling model was developed, 
and cases for both groups were selected to conform to this model. 

The Grade Equivalent means, standard deviations, and intercor- 
relations for both 1960 samples are reported in Table 2. Table 2 
also shows the results of tests for significance of difference between 
the mean Grade Equivalent scores of the two 1960 groups. Again 
ч t tests are based on the formula for uncorrelated data. 

With the two groups matched for general reasoning ability, sex, 
and geographic region, and with both groups representing a national- 
sampling model, the Catholie-school pupils scored а mean Grade 


334 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Equivalent approximately 0.45 years higher than the public-school 
pupils on all three criterion achievement tests. All three differences 
exceeded the .01 level of confidence. 


Discussion of the Findings 


Granted equal mental ability (as defined by Spearman’s g factor), 
and geographic-region and sex controls, parochial-school groups 
scored significantly higher than public-school groups on tests of 
language arts, arithmetic, and reading at the eighth-grade level. The 
differences represented about one G.E. year in the first study and 
about one-half G.E. year in the replication study, and all differ- 
ences were significant at the .01 level of confidence. 

It should be noted that this study involved only three achieve- 
ment criteria. We do not know what differences, if any, would be 


TABLE 2 


Analysis of Mean Differences in Achievement Scores 
for the Two Groups in the 1960 Study 
(N = 1000 in each group) 


Intercorrelations Diff. 
Between 

Test Score Group N-V LA Аг Rd Mean S.D. Means t 
pna een 5.0. Means — to 


Non-Verbal Parochial .48 .53 .54 9.13 2.35 0.00{ 0.00 


Public -—— Hep UN 8.19 2.35 
Lang. Arts Parochial 48 — .67 .66 10.13 2.48 0.47 4.27% 
Public 40 — .69 .67 9.66 2.51 
Arithmetic Parochial .53 .67 — .65 10.29 2.47 0.47 4.27" 
j Public .57 .69 — .70 9.82 2.47 
Reading Parochial .54 .66 .65 — 961 2.29 0.39 3.55" 
Public .54 .07 .70 — 9.22 2.44 
кыо ыо Пы ee —___ 3 


1 Matching variable. 
** Significant at the .01 level (2.581). 


i 1 After completion of this report, Julian Stanley pointed out that a regres- 
sion effect could account for observed group differences in this type of study. 
If our Catholic-school examinees were significantly superior in figural-reasoning 
vis-a-vis our public-school examinees, our matching procedures would have 
resulted in selecting relatively low-scoring Catholic-school youngsters and 
relatively high-scoring public-school youngsters on the figural-reasoning test 
(each relative to the youngsters’ own group). In such an event, score-to-score 
regression might have accounted for the obtained data. 

A check of our data indicates that this was not the case, and the authors 
therefore reiterate the conclusions cited above, However, the possibility 0 
a regression effect suggests the strong desirability of conducting similar future 
studies using an analysis of covariance technique for the total groups tested. 


BAUERNFEIND AND BLUMENFELD 335 


observed if achievement criteria in social studies, science, literature, 
music, art, life adjustment, study skills, ete., were to be included 
in similar types of investigations. 

Moreover, with respect to the three criteria studied, we do not 
know why the significant differences occurred. Test motivation 
should not be a contributor to the observed data, in that the match- 
ing Non-Verbal Reasoning score and the criterion achievement 
scores were derived from one inclusive test session. It is possible that 
the observed achievement differences are cumulative effects of the 
eight years of prior instruction, or that they were effected at specific 
grade levels. It is also possible that the observed differences were 
effected by home or cultural influences unique to the two groups, 
rather than by the school programs per se. These types of possi- 
bilities could well be explored through other investigations. 

It is important to note that these broad findings would not neces- 
sarily apply to any given local group of parochial-school and public- 
school children. But, on a national basis, circa 1960, Catholic-school 
eighth-grade groups showed significantly higher levels of achieve- 
ment in three curriculum areas than did public-school eighth-grade 
groups. 


Procedural Aspects of the Study 


There are three procedural aspects of this study that should be 
noted explicitly. First, the growth of test publishers’ central-scoring 
programs provides a wealth of test data for broad national samples 
that may be useful in investigating a variety of educational prob- 
lems. It seems to the present writers that the formulation of prob- 
lems is not a responsibility of the publishers, but it is noted that 
these sources of test data can be tapped by independent investiga- 
tors with the publishers’ cooperation. 

: Second, use of an achievement-free intelligence test is vital to an 
Investigation of this type. Use of conventional reading-and-arith- 
metic tests of “mental ability” would have precluded this investiga- 
tion, since we would then have found ourselves in the impossible 
Position of matching groups on achievement for purposes of inves- 
gating group differences in achievement. 

i Third, the cornerstone of scientific inquiry is the principle of rep- 
lio tion—that, under similar controlled conditions, one should ob- 
tain similar findings. It seems reasonable that the original investi- 


336 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


gators should assume responsibility for at least one independent 
replication study prior to publication of their findings. 


REFERENCES 


Cronbach, L. J. Essentials of Psychological Testing (Second Edi- 
tion). New York: Harper & Brothers, 1960. 

Science Research Associates. SRA High School Placement Test— 
1959 Edition. Chicago: Science Research Associates, 1959. 
Science Research Associates. SRA High School Placement Test— 

1960 Edition. Chicago: Science Research Associates, 1960. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vol. XXIII, No. 2, 1963 


BEST CLASSIFYING EVERY INDIVIDUAL 
AT EVERY LEVEL 


LOUIS L. McQUITTY 
Michigan State University 


HIERARCHICAL syndrome analysis (McQuitty, 1960b) classifies 
persons (or institutions) into categories on the basis of their pre- 
dominant patterns of responses (or characteristics). The classifica- 
tion is at successive levels, analogous to those of species, genera, 
families, ete., in the biological classification of plants and animals; 
each individual is first classified into the species into which he “best 
fits,” and then each species into the genera into which he “best fits” 
and then each genera into the family into which he “best fits,” ete. 

Multiple hierarchical analysis (MeQuitty, 1962) repeats hierar- 
chical syndrome analysis one or more times. After every person 
has been classified in terms of an hierarchical syndrome analysis, 
4 numerical criterion is applied to determine the level at which each 
individual best classifies, The responses used in this classification 
are called the individual’s predominant pattern. Every individual 
is then reclassified in terms of his next most predominant pattern, 
ete., until every individual has been classified in terms of nearly 
all of his responses, 
^ A possible weakness in the above approaches is that they do not 
insist that, every individual be best classified at every level; they 
first best classi fy the individual into a species and then best classify 
the species into a genus, 

The method here outlined best classifies each individual into a 
group of two persons, then a group of three, then a group of four, 
tte. It, too, is adaptable into a multiple classification system by 
pplication of the appropriate techniques of multiple hierarchical 
Analysis, 


* 


338 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


The method is illustrated by applications to the matrix of Table 1 
(Rows and Columns A through H only) which reports agreement 
scores between companies in terms of union-management relations, 
Companies А and B being in the construction industry, C and D 
in trucking, E in grain processing, F in metal products, and G and 
H in garment manufacturing. This matrix was chosen because it 
was used in describing the methods of Hierarchical Analysis (Mc- 
Quitty, 1960 and 1962). 


TABLE 1 
Best Classifying Individual Companies into Categories of Two and Three Companies 


A B с р Е F G H 


A 
B 29 Vie AT 21548 6 8 10 
[o 19-1717 26 10 8 9 13 
D а ТЭ tt = 1 
Е Jai 039 бано Ri) 47, 18 
F 6 SS и 21 19 17 
G 11 BR а ze \\ тд 24 
н ПАО 13) 147° 94 
29 AB 3860 GONI 6 8 7 
16 AC 16 16 10 6 9 7 
16 AD 16 16 10 пет 7 
14 АЕ 13 10 10 gru 7 
6 AF 6 6 6 6 6 6 
11 AG 8 0% DIDI 6 7 
ТАН 7 T 7 7 6 7 
17BC 16 17-10 6 8 10 
17 BD 16 17 10 6 8 10 
13 BE 13 10 10 6 8 10 
6 BF 6 6 6 6 6 6 
8 BG 8 8 8 8 6 8 
10 BH 7 ОСОО 310 6 8 
2CD 16 17 10 8 DAI 
10 CE 10 10 10 8 9 10 
8 CF 6 6 8 8 8 8 
9 CG 9 8 9 9 8 9 
13 CH T AD 1520310 8 9 
10DE 10 10 10 10 0 10 
12 DF 6 6 8 10 iL ^ 11i 
пра и 8 9 107) 11 11 
11 DH 1 105 L1 10 DC HI 
21 EF 6 6 8 10 175 18 
17 ЕС 11 8-7 594410 17 13 
13 EH T... 10.2500: 1010 13 18 
19 FG 6 6 S BET 17 
17 FH 6 6 8^ dr О 17 
24 ОН 7 8 ОИ 17 


LOUIS L. McQUITTY 339 


Several alternative procedures are possible in every step of the 
analysis. The first step will be described to illustrate a compre- 
hensive approach. A comprehensive approach is usually desirable 
for the initial step because the indices used in this step are often 
less dependable than those used in subsequent steps. However, if 
the matrix is unusually large, a more condensed method, such as 
described later, may have to be applied initially. 

In the first step of the comprehensive version, the analysis is 
restricted to Column A of Table 1. A computer is used to deter- 
mine the agreement score not only of A with every other variable, 
B through H, but also of A with every pair of variables BC through 
GH (excluding those pairs which themselves contain an A, i.e., AB 
through AH). 

An agreement score between two or more companies is the number 
of items on which they agree in their scores; Companies A and B 
agree on an item if they are both either positive or negative. AB 
and C agree if they are all either positive or negative, but not if 
they have any other combination. 

An alternative procedure, up to this stage, would have been 
this: go through each column of the original matrix (defined by 
Rows and Columns A through H) one column at a time, and select 
the four highest entries in each column. Whether we decide on the 
four highest, or some other number of them, depends on how com- 
prehensive we wish to be. If we decide on four, we are in effect 
assuming that this number will bring in all of the pairs with which 
the individual companies have most in common at the triad level. 
For example, in the next stage of the analysis we will wish to clas- 
sity Company A with the two other companies with which it has 
Most in common. We are assuming that these two companies are 
included when we take the four highest entries between pairs from 
every column. However, the assumption is more comprehensive than 
thus far indicated. For example, at the tetrad level, we will wish to 
classify Company A with the three other companies with which it 
has most in common. Consequently, the original selection of pairs 
18 assumed to bring in all of those which are needed for correct 
Classification at all subsequent levels. This is called the selection 
assumption, and the number of pairs selected from each column 
18 called the selection code. 

Another alternative for reducing the laboriousness of the analysis 


340 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


is to use the classification assumption (McQuitty, 1960a). It elimi- 
nates the necessity of computing agreement scores other than be- 
tween pairs as shown in the original matrix of Table 1 (Rows and 
Columns A through H only). 

The classification assumption states that a category has as much 
in common as its two members with the least in common. Consider, 
for example, the Category ABE in which its pairs AB, AE and BE 
have agreement scores of 29, 14, and 13 respectively, with BE hav- 
ing the smallest score, 13. Therefore, the Category ABC is assumed 
to have an agreement score of 13. 

The classification assumption was used in completing the entries 
for the cells of Table 1 which are defined by the intersections of 
Rows AB through GH with Columns A through H. The entry for 
the cell of Row BE-Column A was illustrated in the above example 
and the score 13 is entered in that cell showing that Category ABE 
is estimated to have an agreement score of 13. 

When the classification assumption is used, it is helpful to have 
the agreement scores for the members of the rows written to their 
left. Thus 18, the agreement score for BE, is written at the begin- 
ning of Row BE. 

After Column A has been completed and we have agreed on а 
selection code (such as 4 in this case), we then select the four 
highest entries in the lower portion of Column А. These are under- 
lined in Table 1. 

The procedure is repeated for Columns B through H, one at а 
time, with the four highest entries being underlined to show both 
the categories represented by each highest entry and their agree- 
ment score, i.e., the entry itself, 

Tn the case of ties, such as in Column D where the three highest 
entries are 17, 16 and 16 followed by five agreement scores of 11, 
all five 11's are taken as well as the scores of one 17 and two 16's. 

The underlined agreement scores of Table 1 and their correspond- 
ing triads were transferred to Table 2 and are shown in the left- 
hand column of the latter table; no triad was repeated in Table 2 
even though it occurred more than once in Table 1. 

The next step is to prepare and analyze Column A of Table 2. 
It is prepared from both (a) Column A (Row A through H) of 
Table 1 and (b) the triads together with their agreement scores of 
Table 2. 


LOUIS L. McQUITTY 341 


TABLE 2 
Best Classifying Individual Companies into Categories of Four Companies 


———Є————————_ 


16 АВР 
13 ABE 
16 ACD 


Triads A <8 Cc DEYE ONE 
16 ABC 16 10 6 8 7 
16 10 6 8 7 
10 10 6 8 7 
16 10 6 9 7 
17 BCD 16 10 6 8 10 
RODE 7 10 10 8 9 
11 ADG 8 9 10 6 7 
11 DFG 6 6 8 10 Li 
11 DFH 6 6 8 10 11 
11 DGH 7 8 9 10, 204 
17 EFG 6 6 8 10 13 
13 ЕЕН 6 6 8 10 13 
13 EGH 7 8 9 10 13 
13 FGH 6 6 $ 11:912 6 
11 AEG 8 9 10 6 7 


No entry is shown between Company A and Triad ABC. TThis is 
an example of a practice followed throughout the analysis; no score 
is needed between a company and a category containing that com- 
pany. 

The first score required in Column A is between Company A 
and Category BCD, or in other words the agreement score for 
Category ABCD. Since Category BCD has an agreement score of 
17, we know that pairs BC, BD, and CD each has an agreement 
score of at least 17, and furthermore that one of them has an agree- 

| Ment score of no more than 17 (since the classification assumption 
Was applied in determining the agreement score for the category). 
Then the agreement score for Category ABCD (by the classifica- 
tion assumption) is the agreement score for Category BCD, 17, pair 
AB-29, AC-16, or AD-16, whichever is smallest. Sixteen is the 
smallest number and 16 is, therefore, the agreement score for Cate- 
gory ABCD. The other cell entries of Column A, Table 2, were 
computed in the same fashion. * 

A code selection of 2 was applied to Column A, thus selecting 
Categories ABCD, ACDH, АРОН, and AEGH, with agreement 
tores of 16, 7, 7, and 7 respectively. Four categories were obtained 
rather than just 2 because of the ties for categories with agreement 
Scores of 7. 

| A “code selection” of 2 was chosen for Table 2 in lieu of 4, as in 
Table 1, because the data is presumed to be more dependable at this 


342 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 3 
Best Classifying Individual Companies into Categories of Five Companies 


8 7 


ډه - 


© 
z 
=] 
[со] 
= 


Б 


ko сого со oo oN 
- 
lB-a na 


Bx 


higher level of classification; fewer cases are necessary in order to 
satisfy the selection assumption. However, another consideration 
here was to reduce the tables necessary to illustrate the method. 
Codes as small as 2 are not being recommended at this time when 
computers are available. Different sets of data will require different 
selection codes, depending on both their reliability and type of 
structuring. Experience with data analysis will assist in setting 
appropriate selection codes. 

Columns B through H of Table 2 were analyzed in the same 
fashion as Column A, and then the values of Table 3 were obtained 
from those of Table 2 in the same manner as those from Table 2 
were derived from Table 1. The analysis continued in this same 
fashion, as shown in Tables 4 through 6, until all companies were 


TABLE 4 
Best Classifying Individual Companies into Categories of Siz Companies 


АЛАКОЛ ЕЕ * G н 
———ÓÁ——MÁÁ—— SM Н a a o 


7 ABCDH "i NET. 

10 ABCDE BÉ. «T 
7 ACDEH 7 6 7 

10 BCDEH 7 6 8 

9 ACDEG 8 6 1 
10 DEFGH 6 6 8 

8 CDEFH 6 6 8 

9 CDEGH 7 8 8 


LOUIS L. McQUITTY 343 


1 TABLE 5 
Best Classifying Individual Companies into Categories of Seven Companies 


А B^ О DEBE 2 eee 


7 ABCDEH 6 7 

7 ACDEGH 1 6 

8 ABCDEG 6 1 
8 BCDEGH 7 6 

8 CDEFGH 6 6 


dassified together by Column A of Table 6, and concurred in by 
Columns B and F of the same table. 


Elaboration of the Method 


This method thus far shows how each company (or person) can 
be best classified first in terms of pairs, i.e., with the one other com- 
pany (or person) with which it most agrees, then successively with 
pair, triad, tetrad and so forth of other companies with which 
ost agrees. 
"Multiple hierarchical analysis (MeQuitty, 1962) offers a solution 
or determining which of these levels is most appropriate for best 
fying each company (or person). But this, the first most ap- 
priate classification, is not usually based on all characteristics 
the company (or person) as encompassed by the assessments 
d. This, the first classification, is in terms of the most predomi- 
pattern; there are sometimes the second, third, fourth, ete., 
most predominant patterns. Multiple hierarchical analysis outlines 
techniques for isolating these higher order patterns. They can be 
Adapted to the methods here outlined. 


Some Precautions 


[ Analysis by use of the classification assumption is often more 
Tapid than by computation of the agreement scores, and selected 


TABLE 6 
| Best Classifying Individual Companies into Categories of Eight Companies 
a BVO DT PAEO" н 


7 ABCDEGH 6 
6 ACDEFGH 6 
6 BCDEFGH 6 


344 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


scores can always be computed in checking the assumption; the lar- 
gest estimated score by the classification assumption could always 
be compared with the next largest estimated score, based on an 
actual count of agreements. One could revert to counting whenever 
discrepancies are found. 

The classification assumption has an unusual advantage; it can 
be applied to a great range of indices, including indices of associa- 
tion for continuous data if it is meaningful from a substantive point 
of view to do so. One might be interested, for example, in the inter- 
correlations between people on continuous variables, An hierarchical 
analysis would yield patterns of standing on combinations of scales 
for various categories of people. This solution would be expeditiously 
possible by means of the classification assumption. 

There is a problem in the number to be chosen for the selection 
code. A reasonable solution is to select liberally and to keep a 
record of the order of selection in each column; the code can gen- 
erally be regarded as amply large if the last combination selected 
in a column is never acquired in forming the categories of the next 
high order, such as in going from Table 2 to Table 8, for example. 
If any one (or more) of them is used, then a higher and higher 
selection code should be used until this is no longer the case, Once 
the last selected categories have not been used, it can be assumed 
that additional ones would not have been used if they had been 
available; the solutions obtained are by a reasonable assumption 
the best ones possible. 

The lowest score of a matrix is the agreement score for the cate- 
gory which includes all companies (or persons). This statement de- 
rives from the classification assumption. When the lowest score 
appears once or several times in the analysis of a column, such as in 
Column F of Table 3, it is useless to include the category of any 
one of these in the next phase of the analysis; it will appear again in 
other columns of subsequent phases and the analysis is completed 
when every column of a phase contains no entry higher than the 
lowest one of the original matrix; every company (or person) has 
then been classified into a single category (as shown in Table 6). 

It is helpful, also, to realize that a number may repeat itself sev- 
eral times in a column because the limit of a type is being realized 
in classifying companies (or persoris) into a category. If the type 
is relatively large, this condition can produce many ties in an analy- 


قف а бы‏ و وو ت و б...‏ ف 
س 


LOUIS L. McQUITTY 345 


sis. It is, therefore, reasonable to set an upper limit to which the 
code selection number may be expanded in terms of ties, say no 
more than the code number plus four. 

The method does not determine the associations between individ- 
uals (or institutions) in any fundamental sense; these are merely 
computed as a part of the method. The determination depends on 
both characteristics of the individuals and the tests chosen for the 
study. The selection of tests is, therefore, a matter of great impor- 
tance, but elaboration is beyond the scope of this paper. 


Summary 


This paper reports a method for classifying every individual (or 
institution) first with the one individual he is most like, then with 
the two individuals he is most like, then with the three individuals 
he is most like, ete. The classifications are based on either assess- 
ments by individual items or by total scores on n tests. 

The method can be converted into a multiple system of classifica- 
tions where individuals are classified first in terms of their most 
predominant patterns, and then successively in terms of their sec- 
ond, third, ete., most predominant patterns. 


REFERENCES 


MeQuitty, L. L. “Hierarchical Linkage Analysis for the Isolation 
of Types." EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 
(1960) 55-67. (a) 

MeQuitty, L. L. “Hierarchical Syndrome Analysis.” EDUCATIONAL 
AND PSYCHOLOGICAL MEASUREMENT, XX (1960), 293-304. (b) 

MeQuitty, L. L. “Multiple Hierarchical Classifications of Institu- 
tions and Persons with Reference to Union-Management Rela- 
tions and Psychological Well-Being.” EDUCATIONAL AND PSYCHO- 
LOGICAL MEASUREMENT, XXII (1962), 513-531. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vol. XXIII, No. 2, 1963 


VALIDITY STUDIES SECTION 


Edited by 
WILLIAM B. MICHAEL 
University of California, Santa Barbara 


Predicting Success in High School Foreign Language Courses. 
PAUL PIMBLEUR .. ТЕЕ Е 


Intertest Correlations of the Wechsler Intelligence Scale for 
Children and Two Picture Vocabulary Tests. GEORGE Mop, 
Byron W. WIGHT, AND PATRICIA JAMES ................. 


Gains in Various Measures of Communication Skills Relative 
to Three Curricular Patterns in College. WILLIAM В. Місн- 
T ROBERT CATHCART, WAYNE S. ZIMMERMAN, AND MILO 

ILFS 


оеооооонооо онооно ово ооео ове евева о ово ео асо оос ево 


Stability of Predictive Validities of High School Grades and of 
Scores on the Scholastic Aptitude Test of the College En- 
trance Examination Board for Liberal Arts Students. WIL- 
LIAM B. MICHAEL AND ROBERT А. JONES ................ 


GRE Aptitude Scores as Predictors of GPA for Graduate Stu- 
dents in Education. WALTER В. BORG ................«. 


Personality Correlates of Success in Student-Teaching. GLENN 
W. DURFLINGBE eieae 0 O aes 


359 


365 


ANNOUNCEMENT REGARDING VALIDITY STUDIES 


The VALIDITY STUDIES SECTION is published twice a year, 
once in the Summer issue and again in the Winter issue, for which 
the closing dates for receiving manuscripts are February first and 
August first, respectively. Although articles between two and eight 
printed pages are usually preferred, an occasional exception is made 
to publish articles of somewhat greater length. 

Considerable flexibility exists concerning format as can be seen 
from a study of recently published articles. However, the model 
presented in the Spring, 1953, issue of EDUCATIONAL AND 
PSYCHOLOGICAL MEASUREMENT still represents a close ap- 
proximation to what is customarily published. Reprints of this model 
study are still available. The prospective contributor is encouraged 
to read the original announcement, 

Tn order that the usual number of articles of other types may 
not be reduced, it is necessary to enlarge the journal and to charge 
the authors for most of the publishing costs. For a running page 
of printed text the cost is fifteen dollars per page with extra charges 
for tables and complex material. Each author receives 100 free re- 
prints, 

Manuscripts should be sent to 

William B. Michael 

Professor of Education and Psychology 
University of California, Santa Barbara 
University, California 


| 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vol. XXIII, No. 2, 1963 


PREDICTING SUCCESS IN HIGH SCHOOL FOREIGN 
LANGUAGE COURSES 


PAUL PIMSLEUR 
The Ohio State University 


Previous studies in this series (Pimsleur, Stockwell & Comrey, 
1962) have attempted to predict achievement in college French 
courses, using as predictors specially constructed pure-factor tests." 
The results of these efforts were: a) achievement as measured by 
the Cooperative French Test was predicted to the extent of .65 
(after correction for shrinkage) by a battery of six tests; b) achieve- 
ment in speaking and in listening comprehension were each predicted 
to the extent of .41 by a battery of five tests; c) the main contrib- 
uting factors were found to be Verbal Intelligence and Interest 
(motivation), although Reasoning, Word Fluency, and Pitch Dis- 
crimination also helped prediction. 

The present study attempts similar prediction at the secondary 
school level, in Spanish as well as in French. The predictive tests 
found most effective on the college population, plus criterion meas- 
ures, were administered in the Spring of 1961 to fifty beginning 
French students and to 174 beginning Spanish students at Culver 
City Junior High School and Culver City High School. In this school 
system a student may elect to begin language study in junior high 
School if his English teacher rates him as having aptitude for such 
study; otherwise he begins in high school. Teachers have reported 
4 difference between students who select French and those who 


select Spanish, the former being, according to them, more highly 
——— 


„ The research reported herein was ith 
i performed pursuant to a contract wi 
меней States Office of Education, Department of Health, Education, and 


350 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


motivated and perhaps more homogeneous in ability though not 
necessarily of higher mean ability. 

Predictor and Criterion Variables. The predictor and criterion 
measures may be described as follows. 


1. Vocabulary (Guilford-Zimmerman). A test of English vocab- 
ulary knowledge. 40 items, 6 minutes. 

2. Interest I. The subject rates on a 5-point scale his interest 
in studying the language he is now studying. 

3. Interest II. Twenty items relating to interest in language. An- 
swers on a 5-point scale. Example: 


I enjoy going to see foreign films in the original language. 

а. a great deal 

b. quite a bit 

с. some 

d. not much 

e. not at all 
4. Linguistic Analysis (elementary form). The subject is given 
a list of foreign forms (adapted from Kabardian) and their 
English equivalents. He is to deduce from these how other 
things are said in this language. 15 items, 10 minutes. 
5. Reading Aloud. The subject is given time to read over à 
paragraph of random English words and is then asked to record 
his reading of it "speaking as quickly as possible while still 
remaining intelligible.” Score is number of words read in 30 
seconds. 
6. Rhymes. The subject is to give as many words as possible 
that rhyme with four given words (LAKE, CLOUD, $0, 
GRASS). Score is number of rhymes given in 2 minutes. 
7. Chinese Pitch. The subject is taught 3 Chinese words which 
differ only in pitch. These are then embedded in sentences and 
the subject must tell which he hears, 30 items, 10 minutes. 
8. Sex. Scored 0 for girls, 1 for boys. 
9. Criterion: Cooperative (French, Spanish) Test, Elementary 
Forms Q and R. A standardized test of achievement in reading, 
grammar, and vocabulary. Total scores used. 
10. Criterion: Pictorial Auditory Comprehension Test in 
(French, Spanish). The subject must select from among four 
pictures the one which correctly illustrates a sentence he hears. 
Tape recorded, 75 items, 30 minutes, 


PAUL PIMSLEUR 1 351 


he scores on these ten measures were analyzed separately for 
french and Spanish, and for the two criteria within each language. 
e data were submitted to a stepwise regression analysis which 
Meets the optimum one-test battery, two-test battery, ete., and 
the regression equation, the multiple correlation coefficient, 
nd various other information at each step. The analyses were 
eríormed on an IBM 709 computer located at the Western Data 
Processing Center, UCLA; the program was BIMD-34, obtained 
from the Biostatistics Laboratory, Department of Preventive Medi- 
iine, UCLA, through the courtesy of Dr. Peter M. Neely. 

Predicting Cooperative French Test Scores, The validities of bat- 
teries of various sizes for predicting Cooperative French Test scores 
are cited below. For brevity, variables are shown in the order in 
which they were added by the program; more precisely, each multi- 
ple correlation coefficient, R, is the validity coefficient of a battery 
Onsisting of all the variables up to that point, taken in order of 
able number. (All eight predictor variables may not appear 
Pause the program stops when the addition of a new variable 
Ould make a negligible contribution.) 


Step Variables added Multiple В 
1 4. Linguistic Analysis 613 
2 6. Rhymes 745 
3 5. Reading Aloud 791 
4 2. Interest I 832 
5 1. Vocabulary 854 
6 3. Interest II 864 
7 7. Chinese Pitch 865 


The choice of a best battery from among these seven is largely à 

er of administrative practicality. The seven measures can be 
ined in a 50-minute class period, but the four and five-test 
fries provide almost as much predictive power as all seven 
. The regression equation associated with the seven-test battery 
follows: 


= 14.18 + 43X, + 2.72Х, + .18X, + .96X, 

— .17Х; + .53X, + .06X;. 
Ost noteworthy in these results is the high degree of prediction 

eved, although a substantial drop in the multiple R may be 

d in cross-validation, The multiple R of .865 is well above 

figures reported in the literature. Typical coefficients for 


352 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


— i we 
68°9 89° 6% звәд, поївпәчәлйшогу 
Axoyrpny youery :пошәўигу “OT 
т Fe OF Sp 39L ously 
9Arv1edoo;) :uoueju '6 
or" ПА %- w- xog `8 
9c V c0 0c 6c gr 67 — PHI әвәшчгу `/ 
86°F ст ae 09° 00° 10 вәшАцү '9 
205 9666 с — с— 020: 7I = pno[y Surpway ‘g 
9r'g сг0Т SE 19° s=- 4 ae" 20 srsA[euy onsm3urT °F 
9U'OI 2099 -BE 18” or- 90 ze 90 TF II 4801940] ‘£ 
96 983 ar" LEN £0 Z1 28 FU £v o ОР І 3801030 ‘Z 
sre SST eg 19 FE: 9r  s£ T0 Be — 77 6c Am[nquooA ‘T 
‘as $ or 6 8 L 9 g y £ g вә[дкивд 


(09 = N) ardues чәпәлд 
виоуомәсу paopupjg pun SUDO ‘81910 A UO pun ләә, fo зиозрүәлдоәләүи] 
1 ЯТЯУЛ, 


PAUL PIMSLEUR 353 


igh school foreign language prediction fall around .50. An estimate 
f shrinkage (McNemar, 1955, formula 75) indicates that a coeffi- 
sient of .845 might be expected on a new sample. This coefficient 
is so high that one is led to think accidental factors may have been 
operating to some extent in favor of the prediction. Nevertheless, 
the battery does show much promise. A validation study is under 
y to attempt similar prediction on a new and larger sample. 
_ There are several points of interest in the order of the variables. 
It might have been expected, оп the basis of previous results in this 
and other investigations, that the test of vocabulary would provide 
est single predictor. However, the Linguistic Analysis Test 
tems to account for much of the variance attributable to vocabu- 
knowledge, plus other variance specific to learning French. 
it would appear to be a very useful test. Even used alone it 
d probably be an excellent predictor, correlating as it does to 
е extent of .613 with the criterion. Previous factor analytic find- 
3 (Pimsleur, Stockwell & Comrey, 1962) indicate that this test 
associated with a Reasoning factor. The fact that it seems to 
count for much of the variance attributable to the Verbal factor 
Vocabulary) means that it probably contains both Verbal and 
Reasoning: elements. The best name for the factor it measures is 
Probably Verbal Intelligence. 
The second largest contribution was made by the Rhymes Test, 
which represents a factor of Word Fluency. This factor, which has 
nade a small but consistent contribution throughout these studies, 
"ems to account reliably for perhaps one or two per cent of the 
lance in foreign language learning. 
The next contribution is that of the Reading Aloud Test, repre- 
g a factor of Speed of Articulation. The role played by this 
18 most uncertain. In a study involving 208 second semester 
nch students at UCLA (Pimsleur, Stockwell & Comrey, 1962) , two 
of this factor both correlated well with teacher grades; in a 
-up with 202 students in the same course a year later, the 
tests failed to correlate significantly with the Cooperative 
nch Test. In the present study, the correlation of Reading Aloud 
Ih the Cooperative French Test is .32, a coefficient significant at 
05 level. Must one conclude that “fast talkers” do less well in 
school French? No clear explanation presents itself. 
İS of interest that motivation as measured plays a relatively 


354 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


small role in this sample. Its contribution is usually more nearly 
equal to that of verbal intelligence. 

Predicting French Auditory Comprehension. The following list 
(same format as above) shows the increase in the multiple cor- 
relation coefficient as variables are added in optimal sequence. 


Step Variables added Multiple R 
1 4. Linguistic Analysis 479 
2 7. Chinese Pitch .511 
3 5. Reading Aloud .543 
4 1. Vocabulary .561 
5 8. Sex .604 
6 2. Interest I .610 
7 3. Interest II .611 


The regression equation associated with the seven-test battery js: 


Y' = 19.85 + .39X, — .80X, + .03X, + .64X, 
— 5X, + 20X, — 3.55X.. 


The Linguistic Analysis Test again provides the best single pre- 
dictor, correlating more highly with the criterion than the Vocabu- 
lary Test (.479 vs. .328). The Chinese Pitch Test, tapping the factor 
of Pitch Discrimination, makes the second highest contribution. 
The Reading Aloud Test again correlates negatively with the cri- 
terion. The role of the Interest measures is again a small one. 

Predicting Cooperative Spanish Test Scores. Turning from French 
to Spanish, the cumulative multiple R is as follows: 


Step Variables added Multiple R 
1 1. Vocabulary .383 
2 2. Interest І 496 
3 4. Linguistic Analysis 539 
4 3. Interest II 547 
5 5. Reading Aloud 549 
6 8. Sex 551 
7 7. Chinese Pitch 552 


The best predietors in this context are Vocabulary and Interest I 
(motivation). This is more consonant with previous findings than 
was the French prediction, where motivation as measured play 
an unusually small role. The Linguistic Analysis Test, which again 
holds up well, makes а contribution independent of vocabulary. The 
zero-order correlation of each of these tests with the criterion 18 
about equal (.341 and .383). Substantially all the predictive power. 


i 


£94 F098 7 38a чотвпәцәлїшогу 
Kxoyrpny ystuedg :иорәуцо) “OT 

66'8 © 89 ззәт, ystuedg 
әлүузләйоогу :попәўигу *6 
or" ee o- mw- xog `8 
69'v Т FI er 30'— WH eseumq) “L 
B 19'b TEST от 0c m- or әш 0 
и бт 0666 80 rU [RE ТТ. рпогу 8шрвә “¢ 
ЖАМ ы 10`6 LI тв" 0— 60 0 80 sisÁpouy orsm3urT °F 
& > VOTE eus тк t: EE E з ӨТ 9t H п 01940] ‘9 
a 9€ °8°@ 61 98 = 00 St or 80 29 І 3801940] ‘4, 
2 03'9 80771 SI 88° 10 o 95 ст T£ or тї Are[nqe20A ‘TI 

a's x от 6 8 L 9 g 7 g € So[q8H8 A. 


(FLT = N) o[dureg qsriedg 
виоуртаэ@] paopuvjg PUD SUDIJI |s2)qi4D A UO PUD 1821, fo euotmjos10049juT 
€ HIIVL 


356 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


of the whole battery can be obtained through the use of the first 


3 or 4 tests. 
The regression equation for the seven-test battery is as follows: 


Y’ = 11.00 + .46X, + 2.40Х, + .08Х, + .54X, + .03Х, 
+ .07X, — .78Xs. 


Predicting Spanish Auditory Comprehension. The tests which 
predict this criterion are: 


Step Variables added Multiple R 
1 3. Interest II .215 
2 1. Vocabulary .207 
3 4. Linguistic Analysis .286 
4 7. Chinese Pitch .299 
5 8. Sex ‚305 
6 2. Interest I .309 
T 5. Reading Aloud .910 


Here again, the pattern is one of verbal intelligence plus motiva- 
tion. However, the validity coefficients are lower than for the other 
criteria. 

Discussion. Several points of divergence among the predictions 
may be pointed out. 

1. In this investigation prediction was found to be more accurate 
for French than for Spanish. The validity coefficients were 865 
and .611 for the two French criteria, versus .552 and .310 for Span- 
ish. This difference was not unexpected since the predictors had 
originally been developed on samples of college French students. 

2. Prediction in each language was substantially higher for the 
Cooperative Test than for the Pictorial Auditory Comprehension 
Test. This probably reflected the higher degree of reliability of the 
former measure. 

3. The factors contributing to prediction were different in the 
two languages. Spanish prediction was accomplished largely by the 
factors of Verbal Intelligence and Motivation, while French pre 
diction was attained by Verbal Intelligence and by either Word 
Fluency (for the reading-writing goal) or Pitch Discrimination (for 
the aural goal). The difference in the importance of motivation for 
the two languages was apparently not associated with differences 
in the two groups since their means and standard deviations were 
almost identical on the interest measures, On the other hand, teach- 


PAUL PIMSLEUR 357 


ers did report higher motivation among students taking French than 
among those taking Spanish. 

Further discussion of these points of difference must await the 
additional evidence of validation studies now under way in which 
larger samples in a different part of the country are being used. 
The essential finding has been the attainment of high predictive 
validity for the French reading-writing goal as well as of reason- 
able predictive validity both for the French aural goal and for the 
Spanish reading-writing goal. 


REFERENCES 


MeNemar, Q. Psychological Statistics. New York: John Wiley & 

Sons, 1955. 

Pimsleur, P., Stockwell, R. P., and Comrey, A. L. “Foreign Lan- 
guage Learning Ability.” Journal of Educational Psychology, 
LIII (1962), 15-26. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vol. XXIII, No. 2, 1963 


INTERTEST CORRELATIONS OF THE WECHSLER 
INTELLIGENCE SCALE FOR CHILDREN AND TWO 
PICTURE VOCABULARY TESTS! 


GEORGE MOED? 
Children's Seashore House 
BYRON W. WIGHT 
University of Pittsburgh 
AND 
PATRICIA JAMES 
Children's Seashore House 


Problem. This study explores the possibility of substituting а 
brief picture vocabulary test for the Wechsler Intelligence Seale for 
Children (WISC) with physically disabled children. This analysis 
of concurrent validity deals with differences in test difficulty, with 
Correlations between the tests, with the efficiency of predicting WISC 
Sores from picture vocabulary scores, and with some incidental 
Tetest data. 

Subjects. All eligible children in а rehabilitation hospital, from 
HX to sixteen years of age, were tested in the study. By eligible 
We mean that they were capable of taking the WISC. Their dis- 
Abilities included such diverse conditions as spina bifida, osteochon- 
dritis, arthritis, and rheumatic fever. Conditions such as these fre- 
Mently involve losses of functions which make it difficult, or even 
possible, to administer a test like the WISC. Furthermore, there 
5 always the possibility of penalizing a severely disabled child for 
EE S c 
Auris study is part of the Research Project at Children's Seashore House, 
iret таа А ian жеш Тһе peo wish to faki raph сын Mar- 
in the collection ~ pia ar app ey rt Wetter for their assistance 

2 Now at Fairlei уна t 8 ^ 

leigh Dickinson University, Madison, New Jersey 


360 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 1 
Means and Standard Deviations of Ages of Groups 


—— —— 
Boys 115.61 30.14 44 
Girls 123.44 33.96 39 
White 116.20 29.05 56 
Negro 125.70 37.25 27 
Total 119.29 32.04 83 


his lack of communicative skill. In contrast to the WISC, the picture 
voeabulary tests require а subject to indicate his response by any 
means of which he is capable. See Table 1 for maturity charae- 
teristics of the subjects. Age «differences between subgroups were 
not significant, as determined by t tests. 

Procedure. Children were given the WISC (Wechsler, 1949), the 
Ammons Full Range Picture Vocabulary (FRPV), Form A (Am- 
mons & Ammons, 1948), and the Peabody Picture Vocabulary Test 
(PPVT), Form A (Dunn, 1959). Tests were administered individ- 
ually. The sequence of administration of the picture vocabularies 
was counterbalanced. All those still in the hospital were retested 
with the PPVT one year later. 

Results. IQ means and standard deviations of the subgroups are 
cited in Table 2. All means and sigmas are based on deviation IQ 
scores. Prorated WISC Vocabulary IQ scores are included as 8 


TABLE 2 
Means and Standard Deviations of IQ's Given by Various Tests 


WISC WISC  WISC  WISC  PPVT  FRPV 
FS PERF VERB УОСАВ 


Boys 
Mean 100.93 103.36 98.66 101.49 98.64 99.50 
S.D. 19.18 18.84 18.38 22.12 22.16 22.96 


Mean 92.26 92.00 92.77 92.38 88.54 90.97 
S.D. 13.73 14.76 13.83 20.63 17.92 17.54 


Mean 100.86 101.93 99.70 101.80 98.21 102.64 
S.D. 19.11 19.87 17.87 22.87 22.58 19.62 


Mean 88.56 91.22 88.00 87.566 984.93 80.66 
S.D. 8.08 880 9.71 15.71 12.65 15.05 


Mean 96.86 98.45 95.89 97.17 93.89 95.49 


S.D. 17.30 17.74 1657 21,78 20.79 2091 . 


GEORGE MOED, ET AL. 361 


matter of interest. The PPVT was more difficult than was the 
WISC Vocabulary, Performance, and Full Scale (p < .05, t test 
between correlated means). Comparisons of performance on the 
FRPV with the other tests revealed no significant differences except 
for the Negro subgroup. The mean for this subgroup was the lowest 
of any in the entire study, in contrast with the mean of the white 
subgroup (on the same test) which was one of the highest. Higher 
means were attained on the Performance Scale of the WISC than 
on the other tests. In general, the spread of differences between 
test means was small, although some children had differences as 
large as 31 points. This indicates the desirability of more than one 
IQ measure in clinical evaluation. 

Table 3 shows the intertest correlations for all the subgroups. 
Sixteen out of twenty correlations of the PPVT with the WISC 
were higher than the respective correlations of the FRPV with the 
WISC; significantly higher in six of the comparisons. The signifi- 
cance of differences between correlated correlations was tested 
using a method described by Edwards (1960, p. 85). The higher 
correlations of the PPVT with the WISC are reflected in predictions 
of WISC scores from picture vocabulary scores (see Table 4) via 
Tegression equations (MeNemar, 1962, p. 123). There is less dis- 
crepancy between PPVT and predicted WISC scores than between 
FRPV and predicted WISC scores. The errors of estimate in pre- 
dicting from the PPVT and FRPV are .93 and 1.12, respectively. 
Although either test can be substituted for the WISC, the PPVT 
looks somewhat better for this purpose. 

Twenty-nine of the children were still in the hospital one year 
later, and they were retested with the PPVT. Their mean IQ 
thanged only two points with a test-retest correlation of .88. 
Against this background, it is probably safe to say that no appre- 
tiable change in intellectual status (as indexed by the PPVT) took 
Place during one year of hospitalization. 

a 


The following material has been deposited as Document number 7470 
with the ADI Auxiliary Publications Project, Photoduplication Service, Library 
Congress, Washington 25, D. C.: (a) Values of t from Student's ё test of 
ose between correlated means, (b) ё test between correlated correla- 
m А copy may be secured by citing the Document number and by re- 
ting $1.25 for photoprints, or $1.25 for 35 mm. microfilm. Advance payment 
required, Make checks or money orders payable to: Chief, Photoduplication 
Vice, Library of Congress. 


362 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 3 
Correlation Coefficients 


WISC WISC  WISC  PPVT  FRPV 
PERF VERB VOCAB 


WISC FS 
Boys 93 .94 84 .85 74 
Girls 87 .88 84 .80 76 
White 92 .92 85 .85 76 
Negro 74 .84 71 .65 55 
Total 92 .92 84 .84 76 
WISC PERF 
Boys E i .68 ^4 .62 
Girls .52 .58 .67 .50 
White Bye | .68 .74 .61 
Negro .25* .29* .50 .26* 
"Total .69 .66 74 .60 
WISC VERB 
Boys .89 .86 .76 
Girls .88 72 .82 
White .88 .84 .80 
Negro .79 .52 59 
Total .88 .82 79 
WISC VOCAB 
Boys 79 73 
Girls 74 77 
White 81 75 
Negro 47 61 
Total 78 75 
PPVT 
Boys 78 
Girls 76 
White 84 
Negro 42 
Total 78 


* Only the three correlations indicated were not statistically significant (р > -05)- 


TABLE 4 
WISC Scores Predicted from PPVT and FRPV Scores 
PPVT WISC WISC FRPV 
50 66 68 50 
60 73 75 60 
70 80 81 70 
80 87 87 80 
90 94 93 90 
100 101 100 100 
110 108 106 110 
120 115 112 120 
130 122 119 130 
140 129 125 140 


GEORGE MOED, ET AL. 363 


Either the FRPV or the PPVT may be substituted for the WISC 
with these physically disabled children. The PPVT was more dif- 
fieult than the other tests but showed greater concurrent validity 
with the WISC. This finding is valuable for certain research pur- 
poses and for brief clinical sereening. Data from a smaller number 
of children showed that the PPVT was also dependable in tracking 
changes in vocabulary skill over a one-year period. These findings 
should be cross-validated on а sample of children who are similarly 
hospitalized because of physical disability. 


REFERENCES 


Ammons, R. B. and Ammons, H. S. Full Range Picture Vocabulary 
Test. Missoula, Montana: Psychological Test Specialists, 1948. 

Dunn, L. M. Peabody Picture Vocabulary Test Manual. Minne- 
apolis: American Guidance Service, 1959. , 

Edwards, А. L. Experimental Design in Psychological Research 

(Revised Edition). New York: Rinehart & Company, 1960. 

McNemar, Q. Psychological Statistics (Third Edition). New York: 

John Wiley & Sons, 1962. ; 

Wechsler, D. E. Wechsler Intelligence Scale for Children Manual. 

New York: The Psychological Corporation, 1949. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vol. XXIII, No. 2, 1963 


GAINS IN VARIOUS MEASURES OF COMMUNICATION 
SKILLS RELATIVE TO THREE CURRICULAR PATTERNS 
IN COLLEGE 


WILLIAM B. MICHAEL 
University of California, Santa Barbara 
ROBERT CATHCART Ax» WAYNE 8. ZIMMERMAN 


Los Angeles State College 
AND 
MILO MILFS 


South Bay State College 


Background. Marks, Catheart, and Michael (1963) completed 
а study in which the relative effectiveness of each of three cur- 
ticular plans in communication skills was evaluated for several sam- 
ples of freshman students from the 1959-60 class at the Los Angeles 
State College. With respect to four different criterion measures 
that were intended to represent achievement in reading, writing, 
critical thinking, and listening, the net amounts of change in aver- 
age score between September 1959 and May 1960 (*net" in the 
Sense of correcting for differences in initial verbal aptitude of each 
sample) were calculated and compared relative to which one of 
three types of curricular offerings (or sequences) groups of men 
and women students chose. Although statistically reliable incre- 
ments in mean scores on the tests were realized for nearly all sam- 
Ples of students, it was not possible to determine either to what 
extent the changes were due to practice effects in testing, to inci- 
dental learning, to increased maturity, or to the nature of instruc- 
tional process, or to what degree one curricular approach was 
Superior to another, The teacher variable could be only partly 
Controlled in that not only different teachers taught in the same 


365 


366 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


eurrieular sequence, but also several participated in more than 
one of the three curricular sequences. Moreover, from a practical 
standpoint the size of the average gains was modest—customarily 
one-fourth to two-thirds of the standard deviation of scores on 
either the first or second test taken. In other words, near the center 
of the range of the score distributions on the four tests employed, 
the average gain would amount to а change in percentile standing 
of between 10 and 20 points. In addition, the validity of verbal 
aptitude test in the prediction of gains was low largely because of 
the existence of attenuated reliabilities in difference scores. Sex 
differences were insignificant. The most important finding was 
that when, through an analysis-of-covariance approach, allowances 
were made for any differential in average ability levels of the 
different groups there were, with one exception, no significant dif- 
ferences in the amount of net mean gain in each of the achievement 
measures associated with each of the groups following any one of 
the three currieular sequences. 

It was decided that a second investigation should be undertaken 
with the 1960-61 freshman class to ascertain whether the findings 
would be comparable with those for 1959-60 group. In addition to 
replication of the same kind of data found for the 1959-60 class, 
new information was sought in terms of intercorrelations of the 
various tests used for each of the samples of a given sex and of a 
particular curricular identification, as well as for the combined 
samples of all male and of all female students (irrespective of cur- 
ricular pattern followed). However, there was no controlled ap- 
proach open to the writers for assessing in a definitive way the 
degree to which gains in achievement could be directly attributed 
to a particular instructional plan. The same three curricular alter- 
natives were open to the 1960-61 students as for the earlier class: 

ES—English Composition (in fall) followed by Speech (in 

spring) 

SE—Speech (in fall) followed by English composition (in 

spring) 

LA—Language Arts—an integrated two-semester course 


Purpose. It was the purpose of the second investigation involving 
members of the 1960-61 freshman class (1) to ascertain the extent 
to which the various criterion measures of achievement in com- 


WILLIAM B. MICHAEL, ЕТ AL. 367 


munication skills given in September and again in May were inter- 
related with the view of possibly dropping certain ones or adding 
others, (2) to assess for each sex group the amount of change in 
mean scores on each of the four test measures relative to type of 
curriculum pursued, (3) to determine the degree of validity of the 
verbal aptitude measure relative to (a) final standing in May on 
each of the four criterion measures and (b) gain in score on each 
of these same four measures between September and May, (4) to 
find the composite validity (in terms of a multiple correlation 
coefficient) between optimally-weighted scores on the verbal apti- 
tude measure and on each of the four tests in communication skills 
given in September (serving as independent, or predictor, variables) 
and each one of the same test measures administered in May 
(serving as the dependent, predicted, or criterion variables), and (5) 
to note any sex differences of possible importance. 

Subjects and Measures. The numbers of men and of women, for 
whom complete data were available relative to the English-Speech 
sequence, the Speech-English order, and the two semester Language- 
Arts course were 37 and 69, 39 and 36, and 47 and 64, respectively— 
with total numbers of men and women for all curricular patterns 
being 123 and 169. 

It is not known whether any systematic bias may have been pres- 
ent in the selection, or formation, of the samples. In any event 
there is no systematic control over teacher differences, as existing 
(ie., intact) classes taught by different professors were studied. 

As in the previous year the same four standardized achievement 
tests were administered: Reading, Writing, and Listening in the 
college-level forms of the Sequential Tests of Educational Progress 
(STEP) and one test in Critical Thinking from the American Coun- 
cil in Education. However, in the instance of the measure of verbal 
aptitude, scores on the School College and Ability Test of the 
Educational Testing Service were not available. In its place the 
English score of the ACT (American College Test), which approxi- 
mates a measure of verbal aptitude, was employed. (In about seven 
cases scores on the College Qualifying Test (СОТ) were used after 
being equated as nearly as possible to those of the ACT). 

The reliabilities of all tests were judged to be sufficiently high 
(stable and consistent) for purposes of group research, although in 
the instance of individual prediction some reservations could be 


368 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


q68  £*9 909. 49- 6DL TOE ЕВ S999. 909 2 
96:08 90°SE 000$ GF OF @9°$Р BOIS 69°6Р 066$ СОЄ И WOO M 
FIZ 8008 — SF б? [42 OF oF [42 6F gg deg) epnjndy үваләд `6 
129 2896 +6 — 69 eg eg 99 19 ze 9g (вур) JUUL PINHO `8 
со £208 98 89 E 7g 0g 6g 19 бӯ eg (ҹә) JUNUL PINHO 77 
999 SHS ve vv SF — 99 69 79 ес OF (Ae) SuruojsrT `9 
SUL ere 1g [42 PF 69 — £9 79 Eg 6F (dog) Suruojsr] `0 
908 c£ 09 ge gg zg 09 18 — $2 ©9 08 (£e) Я#шрвәу ^ 
192 ILS Ir 19 Fg 09 99 £9 — 99 09 (ydag) #шрвә `g 
129 — 9£'88 v 6c 28 ТР Te SF Tř — z9 (Хврү) UNM `Z 
ЖОШО  9U4g PF 1g 1g £r oF 1g oF [52 — (ydeg) SunuM ^T 
2 n (6) (8) (2) (9) (9) [62] (g) (2) (т) 9[q911 A 389 
чәр 


(PUO SUIP UOLO ur sjuroq [eure MV) 
(тоиобоцт avojoq Sopu) umuysatg әүошәд GOT fo dung 110], A) Lof pup (тоиобоцт әаодо satu) иршцзәлд NDJ EZT fo Nduvg 
IDOL эч) 40f uote pavpuvjg pup SUDIJE YM биоро вә, әртүцФү oysmjoyogy 1DQ4IAĄ D PUD sjea], зиэшэаэцоү 443 fo Suowpjoa40042yuT 


I G'ISVIL 


[Ad] ŞO’ ot 39 juvogruSig UAIFI e. 


Ln ——————————————— ———c "QN зз RR 


E پچ‎ — #486 € 150 »+.08 Ф Erg LG 4 sep E JUNUL THUD 
& = 5 ++ g *64 I seb p *IS9'6 PG 6 6L 6 zuruejsrq 
à TI a «681 «ELT 9rT £0 20°T .9U£ Burpee 
== EM ОВТ «GST RUE 6U— Ag FST sun sump pasnipy 
099 F «F9 F »«86 £ 6T0 +98 T xT § +86 Y pr ok Supra, PONHO 
+408 6 #408 6 weld g ++I8 I +4906 T *10`б T$ 6 s20L C Suruejsrp 
E 68 6 we 19 I «68 I «9 Т Sr T €0* Огт +68 Burpeay 
Н 96 6 *61`1 *.«#6`Т */8 I x86 € 0c — *«00 С 61 эшм визор $9047) 
à пәшом пәрү uuo  — uo uuo M uo uuo WW QELLA aso], urer) 
а M ЕЕЕ ЕЕ ЕЕЕ СЕА Jo odAT, 
3 soouenbog [Ty VI as Sa 
E «9ouonbog 1v[norumn?) 


SYS uoiypovunuuo;) из SULIDA «DNDN? 


PUII PAYL pansing OYM sjuopnjg мюшцвзәлд әүршәд pup әүорү fo SİNO) fo sasoog uvayy us SUIDE) PASNIPY pup SUIA) 99042) 
S z WIgVL 


fex d 


370 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


held. Whenever different forms of the achievement tests were util 
ized, slight adjustments in raw scores were effected through use 0 
conversion charts in the test manuals in order that comparability û 


by sex and curricular plan, as well as for the total samples of mel 
and of women irrespective of curricular plan followed, means am 
standard deviations were caleulated, and intercorrelations Were 
found between scores on the eight achievement tests (four sets of two 
tests, one set administered in September and the other in May) and 
the scores on the verbal aptitude measure. Although the gains (ad- 
justed and unadjusted) in mean scores for each sample are reporte 
in Table 2, intercorrelations of test variables are presented in Tabl 
1 only for the total samples of men and women, since the similaritie 
in correlation patterns of the matrices for the six samples seemed 
to justify their omission and a consequent substantial saving in 
space. In Table 3 multiple correlation coefficients are reported 
tween scores on each of the achievement tests on the second (May) 


TABLE 3 ў 
Multiple Correlation Coefficients Between Each of Several Dependent Variables 
(Tests) and Optimally-Weighted Combinations of Selected Independent 
Variables (Tests) for Each of the Samples of Students Who Followed 
а Particular Curricular Sequence in Communication Courses 


(All Decimal Points Omitted) 


Multiple Correlation Coefficients* 


Group Number (N) Rs Ёз Rasso єзї Ra. 
Males ES 37 47 39 65 75 53 
Males SE 39 56 51 72 72 84. 
Females SE 36 70 78 78 83 78 
Males LA 47 64 58 80 70 71 
Females LA 64 64 68 78 77 71 
All Males 123 56 55 69 73 71 
All Females 169 63 70 79 74 75 


* The single number inserted before the dot represents the dependent variable (one 5 
en Nears numbers following the dot designate the даре varlablos (each w 
described as follows: lation with the dependent variable). The variables 


(1) Writing (September) Б 
© Writing (May) i 2 Listening = 
Reading (Septem! Cri: ber) 
(4) Reading (Мау ® E уш 
(9) Verbal Aptitude (September) 


Table 2. 


WILLIAM B. MICHAEL, ЕТ AL. 371 


administration and the optimally-weighted combinations of other 
test variables (including the one of verbal aptitude) given in Sep- 
tember. Moreover, for passing interest, the multiple correlation was 
computed between scores on the test of verbal aptitude as а de- 
pendent variable and a combination of all other achievement tests 
also administered in the fall. 

In view of the fact that the differences between means were rela- 
tively small and comparable in magnitude to those of the previous 
investigation with the 1959-60 class for which use of the laborious 
analysis of covariance model added little refinement, it was decided 
not to employ the intricate and computationally complex covari- 
ance technique. Nevertheless, an approximate adjustment for dif- 
ferences in the initial intellectual level of the samples was carried 
out that in effect would be expected to give almost identical results 
to those furnished by the covariance model in the obtaining of 
estimates of the net amount of change, or gain, in test scores when 
the average aptitude levels of all samples of a given sex are equated. 
(The adjustments were based on use of a single line for all groups 
(of the same sex) to describe the regression of gain scores upon 
those of verbal aptitude instead of on use of a pooled line derived 
essentially from an average of the slopes of the regression lines 
Within each of the samples. In the terminology of Walker and Lev 
(1953, p. 395), the by instead of the b, line was employed. For the 
1959-60 data the differences between the two estimates of adjusted, 
or net, mean gains were trivial indeed.) These “net” gains which 
differ only slightly from the gross gains are reported in Table 2. 

Findings. With respect to the data presented in Tables 1 through 3 
the following findings may be summarized: 


(1) In general, the magnitudes of the coefficients of correlation 
between all pairs of test variables are somewhat higher for 
women than for men. 

(2) Within each sex group there are virtually no appreciable 
differences in the average correlation of achievement tests 
with one another irrespective of whether the correlations 
were obtained for the scores received during September, or 
during May, or were derived from scores on pairs of tests, 
one of which was administered in September and the other 
in May. 


372 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


(3) 


(4 


= 


(5) 


(6) 


(7) 


(8) 


(9) 


(10) 


The range of the magnitudes of the correlation coefficients 
is relatively narrow—a finding which indicates that each 
test correlates roughly to about the same degree with every 
other test. 

Although one might expect that its intended uniqueness 
would serve to yield low correlations, the test of Listening 
for both sexes is more highly correlated with that of Read- 
ing than are the tests in Reading and Writing with each 
other. Apparently the linguistic functions measured by the 
Listening test are not too unlike those found in the other 
achievement measures that tap receptive aspects of language 
skills. 

In view of the comparability of most of the correlation 
coefficients between the various sets of test scores, there is 
also some doubt as to what the test in Critical Thinking 
contributes over and above any of the other measures. 
Within the total samples of men and women there are по 
statistically significant sex differences in mean performance 
on any of the test measures. 

Although showing substantial predictive validity of stand- 
ing on each of the four achievement measures (either in 
September or in May), the verbal aptitude test yields for 
the six subgroups studied validities between —.16 and +.28 
with gain scores on each of the achievement measures. As а 
matter of fact for the total samples of men and of women 
students the validity coefficients show ranges, respectively, 
of between —.04 and -+.02 and of between —.05 and +.04. 
The lack of reliability in gain scores probably accounts to à 
large extent for the near zero validities. (These validity 
coefficients are not reported in any of the Tables.) 
Although most of the gains in average test performance 
between September and May are modest, most of them are 
statistically significant (see Table 2). 

The adjusted gains differ only slightly, and from a practical 
viewpoint inconsequentially, from the gross gains. Thus 
differences in initial intellectual level of the participating 
groups can be viewed as of trivial importance. 

Although for each sex the differences in mean adjusted gains 
on each one of the four achievement measures relative 40 


WILLIAM B. MICHAEL, ET AL. 373 


each of the three curricular plans were not tested for sig- 
nificance, it could be seen for the most part that the magni- 
tudes of the differences (i.e., the differences between mean 
difference) were usually between zero and about three 
points—less than or at most equal to one standard error of 
measurement associated with an individual score. (These 
data are not reported.) In light of experience gained in the 
investigation by Marks, Cathcart, and Michael (1963), it 
would seem that only in the instance of the test of Critical 
Thinking administered to men might there be any significant 
differential effects associated with curricular sequence. In 
short, from a practical standpoint the differences in amounts 
of mean gain are of relatively little importance. 

(11) From the relatively high coefficients of multiple correlation 
and from the relatively narrow range of values, it is appar- 
ent in Table 3 that each of the achievement tests is closely 
associated with a weighted combination of the other achieve- 
ment tests (the women showing somewhat higher degrees of 
association than the men). 


Conclusions and Recommendations. From the results presented, 
the following conclusions and recommendations can be formulated: 


(1) Although in most instances statistically significant gains in 
mean level of performance appeared on each of the four achieve- 
tests irrespective of very slight and inconsequential differences 
average ability level of the samples, the modest amounts of mean 
ments and the lack of any differential gains being associated 
with any particular type of curricular plan suggest that it makes 
9 practical difference as to which one of the three curricular 
erings a student elects. (It may well be that with different evalu- 
criteria one curricular procedure might be superior to another. 
ever, a control group would be necessary to assess the extent to 
ch gains might be attributed to practice effects and to increased 
turity of the participants over an eight-month period of college 
tendance.) 

(2) For each of the samples differentiated by sex and type of 
Curriculum pursued, it is apparent from the relatively high inter- 
elations that the various achievement tests in Writing, Reading, 
ing, and Critical Thinking, as well as the English section of 


374 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


the American College Test (АСТ) employed as a test of verbal apti- 
tude, are measuring many of the same characteristics and thus 
duplicating one another. In view of the substantial degree of over- 
lap among all five measures, consideration might well be given to 
dropping one or more of the tests in future activities of evaluation. 


REFERENCES 


Marks, Alvin, Cathcart, Robert, and Michael, William B. “The 
Prediction of Gains in Mean Performance in Various Measures 
of Communication Skills Relative to Type of Curriculum Pur- 
sued.” Journal of Experimental Education, in press. 

Walker, Helen M. and Lev, Joseph. Statistical Inference. New York: 
Henry Holt and Company, 1953. 


[ONAL AND Oa tang MEASUREMENT 


STABILITY OF PREDICTIVE VALIDITIES OF HIGH 
SCHOOL GRADES AND OF SCORES ON THE 
SCHOLASTIC APTITUDE TEST OF THE 
COLLEGE ENTRANCE EXAMINATION 
BOARD FOR LIBERAL ARTS 
STUDENTS? 


WILLIAM B. MICHAEL 
University of California, Santa Barbara 
AND 
ROBERT A. JONES 


University of Southern California 


` Problem. For five different sets of samples of freshman males and 
freshman females who entered the College of Letters, Arts, and 
Science of the University of Southern California during the years 
1956, 1957, 1958, 1960, and 1961 it was the purpose of the investiga- 
tion to ascertain from class to class the absolute magnitudes of the 
Predictive validity coefficients of the part scores and the total scores 
"the Scholastic Aptitude Test (SAT) of the College Entrance 
mination Board (CEEB) and of the grade-point averages in 
emie courses in high school as well as the relative degree of 
ity of these same predictors when used in combinations. 
tistical Procedures. Traditional correlational and multiple- 
ession techniques were employed. 


1Based in part on a paper presented to the annual meetings of the Cali- 
ia Educational Research Association, Monterey, California, March 10, 


For assistance rendered at various times in the collection and analysis of 
that resulted in the correlational information to be presented, grateful 
owledgement is made to David Baker, Anna Cox, Arthur Gershon, Mary 
nan, Marvin Hoover, Kenneth Katz, Linda Robinson, Dennis Smith, W. A. 
„апа Altha Williams. 


375 


376 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Findings. Consistent with the fact that there has been a reduc- 
tion between 1957 and 1961 of approximately one-fifth in the (un- 
reported) magnitudes of the standard deviations in each of the 
predictor variables considered, there has been a trend (although 
not monotonic) for a decline in the magnitudes of the validity 
coefficients of each of these same predictors. This reduction in 
variance of each predictor has been associated with a corresponding 
unreported rise in its mean, approximately equal to three-fifths of 
its standard deviation in 1956 and in 1957. Similar findings would 
be expected in many colleges with highly selective admissions stand- 
ards that have become even more stringent during the past four or 
five years in view of the competitive pressures arising from an ever- 
inereasing population of college applicants. In particular the fol- 
lowing findings may be summarized from inspection of Tables 1 
and 2: 


(1) Consistently, the record of academic achievement in high 
school has been more predictive of success in college work for both 


TABLE 1 


Predictive Validity Coefficients of High School Record (Last Three Years), CEEB 
Total Score, CEEB Verbal Score, and CEEB Mathematics Score Relative 
lo a Criterion of Grade-Point Average During the First Semester 
(1957 and 1958 Classes) and During First Two Semesters 
(1956, 1960, and 1961 Classes) of Work in the Liberal 
Arts College of the University of Southern California 
(All Decimal Points Omitted) 
———————Є——Є—————= 
1956 1957 1958 1960 1961 
Fresh- Fresh- Fresh- Fresh- Fresh- 


Sex Predictor men men men men men 
N =186 N =310 N =209 N =209 N =228 
Men HighSchool Record 58 40 44 » 40 35 
CEEB—Total Score 37 29 2 25 
CEEB Verbal ka ; 
ore 45 27 
CEEB—Math T vá н 
Score 46 22 24 23 16 
N =23 N =272 N =177 N =233 N = 288 
Women High School Record 47 51 52 ái 52 48 
CEEB—Total Score 50 48 48 36 28 
CEEB—Verbal 
Score 53 45 41 41 29 
CEEB—Math 
Score 34 39 39 20 16 


MICHAEL AND JONES 377 


TABLE 2 
Coefficients of. Multiple Correlation. Between. the Criterion of Grade-Point Average 
Earned During the First Semester (1957 and 1958 Classes) or During the 
First Two Semesters (1956, 1960, and 1961 Classes) of College Work in 
Liberal Arts and Selected Combinations of Predictor Variables 
(All Decimal Points Omitted) 


Combinations of 1956 1957 1958 1960 1961 
Sex Predictors Class Class Class Class Class 
к т РЕСЕ 


N =186 N = 310 N =209 N = 209 N = 228 
Men (a) CEEB—Verbal 

and CEEB— 

Math 46 31 33 28 27 
(b) High School 

Record and 

CEEB Total 

Score 55 47 47 44 39 
(c) High School 

Record, CEEB 

Verbal, and 

CEEB—Math 

Score 60 47 47 44 42 

N = 234 N = 272 N = 177 М = 233 N = 288 
Women (а) CEEB—Verbal 

апі СЕЕВ— 

Math 53 47 48 41 29 
(b) High School 

Record and 

CEEB Total 

Score 54 56 56 56 50 
(c) High School 

Record, 

CEEB— 

Verbal, and 

CEEB—Math 

Score 60 59 59 61 53 


SR ee 


men and women students than have been either part or total scores 
on the SAT of the CEEB. 

(2) For both sexes the verbal score on the SAT has been about 
as predictive of college success of first-year students in liberal arts 
as has been the total score. 

(3) A combination of high school record and scores on the SAT 
(either the total score or a weighted combination of the two part 
Scores) has yielded higher validities than has use of individual 
predictors. 

(4) Although beta weights are not cited, the high school record 
has received consistently a weight approximately twice that asso- 
ciated with the total score on the CEEB. 


378 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


(5) With two exceptions in 1956 and one exception in 1960, the 
validity coefficients of each predictor have been higher for women 
than for men. 

Conclusion. With progressive restrictions in range of measures in 
various selective devices, reduction in the variance in predictors 
such as high school grade-point averages and scores on tests of 
scholastic aptitude may be expected in colleges that require from 
year-to-year higher minimum standing in each of these predictors. 
Correspondingly a decrease in the size of predictive validity coeffi- 
cients may be anticipated, although the relative importance of each 
predictor may remain essentially invariant. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vol. XXIII, No. 2, 1963 


GRE APTITUDE SCORES AS PREDICTORS OF GPA 
FOR GRADUATE STUDENTS IN EDUCATION 


WALTER R. BORG 
Utah State University 


Purpose. 'The purpose of this study was to determine the predictive 
validity of the scores in verbal ability and quantitative ability of 
the Aptitude test of the Graduate Record Examination relative to 
the criterion of grade point average (GPA) of graduate students in 
the Department of Education at Utah State University. (The 
verbal and quantitative portion will be designated in GRE-V and 
GRE-Q, respectively). Some previous research has been concerned 


With this problem. Using graduate school grade point average in | 


- 
education courses, Conway (1955) found correlations of .27 with 
GRE-V and .23 with GRE-Q. His sample of 36, however, included 
only students receiving the master's degree, thus restricting the 
ability range. Capps and Decosta (1957) reported a correlation of 
34 between grades in four basic professional courses and the GRE 
Aptitude Test for a sample of 41 graduate students in education. 
Sample. All graduate students in the Department of Education 
at Utah State University who had taken the GRE during the pre- 
ceding five years and who had completed at least fifteen quarter 
hours of work subsequent to the bachelor’s degree were included in 
the study. The fifteen hour minimum was established to increase 
the reliability of the criterion although it was realized that this 
Would probably lead to some restriction in range. Students were 
Nearly all master's degree candidates in elementary education, in 
secondary education, or in educational administration. Students in 
areas such as music education, industrial arts education, and physi- 
cal education were not included. 


379 


380 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


=. TABLE 1 
D < TER x 
Means, Standard Deviations and Validity Coeficients 
E (N = 175) 
Standard Validity 

Variable Mean Deviation Coefficient 
GRE Verbal 424.17 96.14 .36 
GRE Quantitative 447.48 95.24 .97 
Grade Point Average 3.34* .283 


ONC Maeda se aê eee 
* Grade point average is based on a scale in which the number points is as follows: A = 4, 
B=3,C =2,D=1,F =0. 


Results. Table 1 gives means, standard deviations, and predictive 
validities for the sample of 175 cases meeting the requirements of 
the study. The low predictive validity of both GRE Aptitude Scores 
seems to indicate that this measure, used alone, is of little value as 
a predictor of success in the graduate program in education at Utah 
State University. The correlation between GRE-V and GRE-Q for 
the sample was .60. 

Of the 175 cases used in this research, 29 had GPA’s below 3.00, 
(B average), the level required for graduate students in satisfactory 
standing at Utah State University. The mean GRE-V for the group 
below GPA 3.00 was 354.83, with a standard deviation of 60.69. 
Their mean GRE-Q was 398.28, with a standard deviation of 78.72. 

A total of 29 students who had taken from 3 to 14 quarter hours 
of work were also identified and their mean scores and standard 
deviations on the pertinent variables were calculated. This group 
had a mean of 411.38 on the GRE-V with a standard deviation of 
85.57 and a mean of 423.45 on the GRE-Q with a standard deviation 
of 95.69. Mean GPA of this group was 3.28 with a standard deviation 
of .55. Although elimination of these cases from the study increased 
GRE means, it appeared to have had little effect on GRE varia- 
bility. 

Discussion. The data in Table 1 suggest that the GRE has little 
predictive validity for a relatively unrestricted sample of graduate 
students in education. It will be noted that the standard deviation 
of this group is very close to the S.D. of 100 established when the 
scale scores were developed by the Educational Testing Service 
(1956, p. 8). It is surprising that the GRE-V had no greater pre- 
dictive validity than the GRE-Q, as it is usually assumed that the 


WALTER R.BORG - 381 
content of most courses taken by graduate students in education 
requires the types of aptitude sampled’ by. € verbal measure. Per- 
haps the predictive validity of the GRE-Q | Táised bythe fact that 
the few courses requiring quantitative aptitude (such as, research 
methods, educational measurement, and statistics) are often: the 
most variable in terms of grades given graduate students. Since 
grades in many graduate courses are limited almost entirely to A 
or B, the discrimination value of the GPA is reduced. 

In order to determine whether the GRE discrimination between 
successful (3.00 GPA or over) and unsuccessful (under 3.00 GPA) 
students, the percentage of students in the total sample earning 
GRE scores within each interval (formed by steps of one-half 
standard deviation) was computed in terms of who were successful 
or unsuccessful. Table 2 gives the number and percentage (relative 
to the total sample) of successful and unsuccessful graduate stu- 
dents earning GRE-V and GRE-Q scores within each of six inter- 
vals. It may be seen that the verbal score discriminates somewhat 
more effectively than the quantitative score at the lower levels. If a 
GRE verbal cutoff had been established at one-half standard devia- 
tion below the mean, 72 per cent of the unsuccessful students and 27 
per cent of the successful students would have been eliminated. In 
terms of the total number of subjects, however, this cutoff would 
have eliminated 21 unsuccessful and 41 successful students. The 
value of such a cutoff seems highly doubtful in a situation simi- 
lar to that existing in the Utah State University College of Educa- 
lion. 


TABLE 2 
Numbers (N) and Approzimate Integral Percentages (%) of Successful and 
Unsuccessful Students at Different GRE Levels 


‚ Level оп GRE GRE—Verbal GRE—Quantitative 
in Terms of Standard Unsuccessful Successful Unsuccessful Successful 
Deviation (S.D.) N % N % N % N % 
Above 1 S.D. ERE aru cod 
+34 S.D. to +1 S.D. о" 20 dE 20717 18) 12 
Mean to --14 S.D. 2 т 2 2 BO O = 18 
=4 S.D. to Mean 4 14 26 18 6 20 39 27 
~=» 5.0). to —1 S.D. 15 51 27 18 8 28 13 9 
Below —1 S.D. @ "op 2 39 8 98 ^95 18 


Totals 29 100 146 100 29 100 146 100 
BINE EOS s EE 


382 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


REFERENCES 


Conway, Sister Madona Therese. “The Relationship of the 
Graduate Record Examination Results to Achievement in the 
Graduate School at the University of Detroit.” Unpublished 
Master’s thesis, University of Detroit, 1955. 

Capps, Marian P., and Decosta, Frank A. “Contributions of the 
Graduate Record Examinations and the National Teacher Exami- 
nations to the Prediction of Graduate School Success.” Journal of 
Educational Research, L (1957) , 383-389. 

Educational Testing Service. Summary Statistics 1956-57 Gradu- 
ate Record Examinations. Princeton, New Jersey: Educational 
Testing Service, 1956. 


EpucaTIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vol. XXIII, No. 2, 1963 


PERSONALITY CORRELATES OF SUCCESS 
IN STUDENT-TEACHING 


GLENN W. DURFLINGER 
University of California, Santa Barbara 


AN evaluation of the literature upon the related subjects of 
teacher competence and of the prediction of teaching effectiveness 
reveals that research in these areas has been going on since well 
before the beginning of the twentieth century. Hach decade brings 
forth new data, new techniques for solving the problems, and addi- 
tional light upon the complexities of this immense problem. Today 
there is probably no other area of research in education that is 
drawing more attention from educators and from lay persons than 
that of teacher competence. Why? Three reasons central to this 
interest would seem to be associated with a desire: (a) to develop 
the most effective techniques for the selection and for guidance of 
young people who are considering careers in teaching; (b) to estab- 
lish the most helpful university program for the preparation of our 
teachers; and (c) to be able to reward with tenure, promotion, or 
increased salary the persons in the teaching profession who are most 
deserving. 

Pertaining to the first of these reasons is the problem of this 
study—the determination of the personality correlates of success in 
student-teaching, 

Purposes of this Study. The two major purposes of this study 
Were: (1) to assess the validity of certain instruments designed to 
Measure the personality of college students as predictors of effective- 
Ness in student-teaching, and (2) to determine which personality 
traits are most highly correlated with specific characteristics of 
teaching effectiveness. 


384 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Procedure. The procedure employed in this study was first to 
select several criteria of teaching effectiveness, each of which was 
sufficiently reliable to enable it to serve as a sounding board for 
checking student-teaching personality evaluations. The following 
criteria were employed: (1) a teacher rating scale; (2) grades in 
student-teaching; (3) grades in procedures (methods) courses; 
(4) total grade or honor point averages; (5) parts and combinations 
of these four. 

Against these criteria were correlated measures on the following 
test and subtest predictors for 150 college students: (1) the Cali- 
fornia Psychological Inventory (CPI), (2) the Heston Personal 
Adjustment Inventory (HPAI), (3) the Minnesota Teacher Atti- 
tude Inventory (MTAI), (4) the Elementary Teacher and the 
Femininity-Maseulinity Scales of the Strong Vocational Interest 
Blank for Women, (5) the American Council on Education Psycho- 
logical Examination (ACE), (6) How I Teach by Kelley and 
Perkins—an analysis of teaching practices (HIT), and (7) the 
Personal-Social Scale of the Roeder Occupational Aptitude Test 
(PSOA). In addition to validation of these predictors, it was hoped 
that those personality attributes of an elementary school teacher 
could be described which most effectively contribute to certain 
purposes of instruction and to teaching effectiveness in the class- 
room. In the interest of completeness, intercorrelations both among 
the predictors and among the criterion variables were obtained. ] 

The Criteria. 'The teacher rating scale is an unpublished, single- 
sheet, teacher-evaluation device consisting of forty-one items within 
seven subdivisions. It was developed at the Santa Barbara campus 
of the University of California to satisfy the need for a short 
teacher-rating scale with at least a reasonably high reliability to be 
used with student-teachers. Its coefficient of reliability has been 
found to be .91 for two applications of the instrument by the same 
supervisor and .81 when employed by two different supervisors. 
The reliability of the parts runs from .64 to .80 when the scale 18 
used by the same person. There were two separate ratings for each 
student-teacher—one by his supervisor during his first semester of 
student-teaching and one by a different supervisor during his second 
semester of student-teaching. 

The student-teaching grades were based upon eight semester 
hours of student-teaching divided equally between two semesters of 


GLENN W. DURFLINGER 385 


m teaching under the supervision and with the assistance of 
п expert teacher—the regular classroom teacher. 

In the semester immediately preceding the year of student- 
ching, the student enrolled in eight semester hours of teaching 
or teaching methods divided into two classes of three 


е total grade or honor point average, commonly called GPA, 
quotient of the honor points divided by the units taken. An 
grade gives four honor points; “B” gives three, ete. 
Instruments to be Validated. The California Psychological 
ory (CPI) which was developed as an assessment instrument 
ssessing broad personal and social significance was validated on 
е favorable and positive aspects of personality rather than on 
lathological characteristics. It has four class subdivisions and 
ееп subtests. 
-Test-retest reliability coefficients on 226 high school students 
nth a lapse of 12 months between administrations ranged from .57 
0.77 for 16 of the 18 scales, Only Py (Psychological-mindedness) 
id Cm (Communality) yielded lower coefficients. 
The Heston Personal Adjustment Inventory (HPAI) has been 
sed in similar studies designed to predict success in teaching. It has 
sts which purport to measure the following six aspects of 
justment: 


—Analytical Thinking: e.g., intellectual independence; ^ 
—Sociability: presence of extroverted or introverted tendencies; 

—Emotional Stability: tolerance of frustration; ! j NI 
Confidence: a lack of feelings of inferiority associated with decision- 
^ making capability ; ; 7 
Personal Relations: absence of a tendency to be annoyed or irritated 
by others; presence of disposition toward being fair and im- 
personal; and à 1 
Home Satisfaction: existence of pleasant family relations; recog- 
nition of one’s obligation to home and family. 


he reliability coefficients reported in the manual of this instru- 
t for each subtest are as follows: 


A CEST ERO. H 
86 1 .86 84 80 87 


innesota Teacher Attitude Inventory (MTAI) according to 
“Uthors was designed to measure attitudes toward children and 


» 


386 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


toward selected aspects of teaching. The inventory is composed of 
150 items purporting to measure attitudes through the following 
five aspects of child development and education: moral status of 
children, i.e. children’s adherence to adult-imposed standards; disci- 
pline and problems of conduct; principles of child development and 
behavior; principles of education related to philosophy, curriculum, | 
and administration; and personal reactions of the teacher. The 
authors have reported a reliability coefficient of .93 for the MTAI. ° 

From the Strong Vocational Interest Blank only-the Elementary 
Teacher interest scale and the Feminine-Masculine interest scale 
were employed. In the latter scale the results for the few men in the 
sample were withdrawn. 

The American Council on Education Psychological Examination 
(ACE) has been administered for many years on an institution- . 
wide basis at the Santa Barbara campus. The test consists of two — 
major divisions, the quantitative and the linguistie, from which а 
total score may be derived. The two divisions, as well as the total 
scale, were utilized in this study. $ 

It was thought that the investigation should include some measure 
of teaching competence as tested by an instrument designed to 
measure knowledge of teaching practices and professional back- 
ground. The Kelley-Perkins How I Teach test (HIT) was selected 
to cover this area. | 

The final test was the Personal-Social scale of the Roeder cei 
pational Aptitude Test (PSOA). It was thought that such a test 
would detect an aptitude in the personal-social area in which teach- 
ing, it would seem, is logically classified. 

In this group of personality tests there is obviously no measure 
of subject-matter competence other than grade-point average. Ас- з 
cordingly, a subject proficiency test, the Iowa Tests of Educational 
Development (ITED), was included. This is intended to be 4 
measure of understanding of social science concepts, of natural 
science concepts, of the ability to interpret reading material m 
several subject areas, and of knowledge of the fundamentals in 
English and mathematics—such proficiency as the student would 
normally acquire in high school. The first seven tests of this serie 
were used; the tests of general vocabulary and use of sources of 
information were omitted. The names of the test used are found in 
Table 1. 


а. 


GLENN W. DURFLINGER 387 


E TABLE 1 
Correlation Coefficients Between Thirty-Nine Predictors and Each of Two Criteria 
i (АП decimal points omitted) j 
OO د‎ 

4 Criteria 

Student 
r Rating Teaching 
bles Seale—1 Grades—2 
I—California Psychological Inventory 
Dominance 00 00 
Capacity for Status —04 04 
Sociability 08 26 
Social Presence —01 18 
Self-Acceptance 04 -24 
Sense of Well-Being —07 18 
Responsibility 04 19 
Socialization 12 26 
Self-Control —13 —01 
Tolerance —14 19 
Good Impression —15 —16 
Communality 07 -24 
Achievement via Conformance —14 —10 
Achievement via Independence —22 —81 
Intellectual Efficiency —13 04 
Psychological-Mindedness —19 =47 
Flexibility —27 —45 
Femininity 02 —06 
II—Heston Personal Adjustment Inventory 
Analytical Thinking -12 -—13 
Sociability —25 —10 
Emotional Stability —06 13 
Confidence —23 -19 
Personal Relations е —15 c 
Home Satisfaction 06 02 
III—M innesola Teacher Attitude -12 505 
IV—Strong VIB 
Elementary Teacher 19 17 
Femininity-Masculinity —14 01 
V—ACE Psychological Ezamination دنا‎ 
ACE Quantitative —08 = -H 
ACE Linguistic -13 = 
ACE Total -13 -11 
VI—How I Teach -13 cd 
VII—Personal-Social Occupational Aptitude 05 06 
VIII—Subject Matter Proficiency 
Understanding Social Concepts —04 -10 
Background—Natural Sciences —16 —09 
Correctness of Expression 15 01 
Ability— Quantitative Thinking 2 14 
Interpretation—$Social Studies —03 —05 
Interpretation—Natural Sciences —07 —05 
Interpretation—Literary Materials 08 —15 


Note. Correlation coefficients of 21 or greater are significant at the .01 level 
"td hence are italicized. 


388 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


The Student-Teacher Sample. As college freshmen or sophomores 
there was an initial group of 464 students who were administered 
the battery of tests. This number comprised students planning to 
teach from the kindergarten to college. One hundred and fifty of 
these were interested in teaching at the elementary school level and 
survived the academic rigors, offers of marriage and the frequently 
resulting abandonment of plans for a career in teaching, temptations 
to transfer elsewhere, financial hardships, or other dissuading factors. 
Practically all of the students were women. In fact there were 130 
women and 20 men in the sample employed. 

Their student-teaching experience, during which the ratings were 
made with the rating scale, in almost all instances encompassed 
two semesters, in two different public schools, in two different 
grades, with two different classroom teachers. There was super- 
vision by two different individuals employed by the University 
specifically for the purpose of supervising student-teachers. A few 
students completed all of their student-teaching in one semester in 
the same school and in the same grade. In all cases the student- 
teaching took place during the student’s final year in college. 

The rating scale previously described was applied at the end of 
each semester of the student-teaching experience. The criteria of 
student-teaching evaluation consisted of a composite of ratings for 
the two semesters. There was a remarkable degree of consistency 
between the two ratings. 

The University of California by virtue of its high and tradi- 
tionally academic entrance requirements attempts to select for its 
students only the top twelve to fourteen percent of high school 
graduates. This selective process would tend to reduce the range of 
talent and thus the possibility of the occurrence of high correlation. 
coefficients in those traits correlated with a criterion variable. One 
could expect the greatest amount of attenuation in validity coeffi- 
cients within the two test areas of academic aptitude and academe 
achievement, if one may assume that standing in these cognitive 
domains is related to success in teaching. 

Results. In Table 1 are shown the coefficients of correlation of 
each of the personality tests and the subtests with two of the previ- 
ously cited criteria: the total rating scale scores and the student- 
teaching grades. 


Discussion. In interpretation of the predictive validities, it ap- 


GLENN W. DURFLINGER 389 


pears that in terms of the results of the CPI the successful elemen- 
tary school student-teacher is as "normal" in most personality 
- characteristics as are the groups employed in standardizing the 
tests and scales. The successful teacher is not different from mem- 
bers of the standardization sample in dominance, social initiative, and 
capacity or desire for social status. He tends to display to a signifi- 
eant degree an outgoing, sociable, and participative temperament. 

There is the indication from the grades in student-teaching that 
the more successful teacher shows a lower degree of self-acceptance 
—a finding which suggests that he tends to be conventional and 
quiet and given neither to self-centeredness nor aggressive behavior. 
In terms of the criterion of student-teaching grades, there is almost 
а significant relationship for the predictors of Sense of Well-Being, 
Responsibility, and Tolerance. 

In the capacity for ereating а good impression and the desire to 
do so, the successful teacher appears to be slightly less high than is 
the relatively unsuccessful teacher. 

Also in terms of his student-teaching grades there is a significant 
negative correlation with the general modal pattern (or commu- 
nality) of the CPI. 

The successful teacher tends not to achieve or to display achieve- 
ment potential corresponding to those factors of interest or motiva- 
tion which facilitate achievement where either conformance or 
independence are positively valued behaviors. Furthermore, he ex- 
hibits a significant tendency to be less flexible than those members 
of the standardization sample. 

In Intellectual Efficiency, which is not a measure of intelligence at 
all but an indicator of the degree of personal and intellectual effi- 
ciency that the individual has attained as a functioning social being, 
there is no indication that this scale does differentiate the successful 
teacher from the unsuccessful one. 

The Psychological-Mindedness scale determines the degree to 
Which the individual is interested in and responsive to the needs 
and motives and experiences of others. Of all variables studied, 

ding on this scale shows the highest negative correlation with 
grades in student-teaching. 
‚ “He femininity of interests of the women student-teachers is not 
"Enificantly related to teaching success in either the CPI or the 
ng inventory. 


390 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


In most of the HPAI scales there is a slight negative correlation 
between what the seale measures and success in student-teaching. 
In Analytical Thinking and Personal Relations there was a slightly 
negative correlation between these scales and the criteria. This 
indicates a tendency toward uncritical acceptance of others’ ideas 
and a slight tendency toward irritability. Only in Sociability and 
in Confidence are the correlations significant at the .01 level (with 
the rating scale criterion). There is the suggestion of social timidity 
and feelings of inferiority. 

In attitudes regarding teaching (variable 25) and in measured 
personal-social aptitude (variable 32), the instruments employed 
detected no significant relationships with success in teaching. 

In the Strong Vocational Interest Blank, the correlation of scores 
on the Elementary Teacher scale with each criterion variable was 
just below the requirement of significance at the .01 level. 

The correlations with the ACE quantitative, linguistic, and total 
scores were low and slightly negative. Although they might be 
expected to be low because of the restriction in range in the sample, 
one would hope that the poorer teachers do not tend to rank higher 
in academic competence than do the better teachers. 

There is a low correlation between the How I Teach test and 
the criteria. One might infer that an early knowledge (in the col- 
lege lower division years) of good teaching principles and practices 
does not materially contribute to successful student-teaching. 

In the areas of the subject-matter background of prospective ele- 
mentary school teachers it appears that only in the area of Ability 
to do Quantitative Thinking is there a positive relationship to teach- 
ing competence and that with only one criterion—the rating scale. 

In order to ascertain the combined effectiveness of some of the 
variables and each of the criterion measures, a multiple correlation 
coefficient of the predictors with the four highest correlations with 
each criterion was calculated. These multiple coefficients were 37 
with the rating scale and .67 with student-teaching grades. It would 
seem that a multiple R which uses at least four of the highest cor- 
related personality factors would be extremely useful in screening 
and guidance for a career in elementary teaching. Of course, before 
these variables are used in multiple regression equations 107 
purposes of selection, cross-validation studies with new samples 
would necessarily be required. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vol. XXIII, No. 2, 1963 


BOOK REVIEWS 


Edited by 


WILLIAM B. MICHAEL 
University of California, Santa Barbara 


Gerberich, Greene, and Jorgensen’s Measurement and Evalu- 
ation in the Modern School. Davin A. PAYNE ............ 
Blums A Model of the Mind. HAROLD Вовко.............. 
v and Heinze’s Creativity and the Individual. Скокев I. 
ROWN 


BENJAMIN KLEINMUNTZ. ооо ж, изаа 
Yuzuk's The Assessment of Employee Morale. J. Н. Rar- 


МАМ. IO И eens UL 
Crow and Crow's Readings in Guidance: Principles, Practices, 


Edition). Vicrog B. CLINE .... mmt m 
Sarason, Davidson, Lighthall, Waite, and Ruebush’s Anxiety 
tn Elementary School Children. Davo В. STUCKI, ELINOR 
WINE, AND JULIAN C. STANLEY .... ntt 
eGowan and Schmidt's Counseling: Readings in Theory and 
actice. HENRY KACZKOWSKI «seen 
Bellows’ Psychology of Personnel in Business and Industry. 
Witam COLEMAN iae venne oas n nata tnn od no ENTUM 
Ussing's Study and Succeed. REGINALD L. JONES AND ROBERT 


ПОТИ Pree AAO 


402 


403 


392 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Waters, Rethlingshafer, and Caldwell’s Principles of Compara- y 
tive Psychology. PATRICK J. САРВЕТТА................... 411 
Cowen, Underberg, Verillo, and Benham's Adjustment to 

' Visual Disability in Adolescence. REGINALD L. Jones. .... 413 

Blair, Jones, and Simpson's Educational Psychology (Second h 
Edition). REGINALD L. JONES. ..... OE зз.» 414 

Geldard’s Fundamentals of Psychology. Roy M. Fircu. ..... 415 


ESOCATIOXAL AND PSYCHOLOGICAL MEASUREMENT 
Vel. XXIII, No. 2, 1963 


Measurement and Evaluation in the Modern School by J. Raymond 
Gerberich, Harry A. Greene, and A. N. Jorgensen. New York: 
David McKay Co., Inc., 1962. Pp. xviii -+ 622. $6.95. 

This introductory text was an outgrowth of the authors' concern 
with the measurement problems of both the elementary and 
secondary teacher. It is considered “. . . suitable as a first book in 
measurement and evaluation for those students . . . who know very 
little about measurement in education and its possibilities for the 
improvement of classroom instruction." It is felt that the emphases 
throughout, however, are still on the evaluative and diagnostic 
functions of measurement, rather than on the instructional uses. 

The twenty-five chapters are organized into five general parts as 
follows (number of chapters allocated in parentheses): Acquiring 
Background For Pupil Appraisal (3) ; Using Standardized Tests and 
Techniques (5) ; Constructing and Using Classroom Tests and Tech- 
niques (4); Applying Statistical Procedures to Measurement and 
Evaluation Results (2); and Measuring and Evaluating in the 
School Subjects (11). Each chapter begins with a general topical 
Outline and concludes with helpful suggested activities, topics for 
discussion, and selected references. In addition, separate indexes . 
are furnished for names and subjects. Two appendices are provided; 
one “cook-booking” the computation of the Pearson product-moment 
Correlation coefficient, and the other furnishing a directory of test 
publishers, 

References and material covered are for the most part current. 
Because of the tremendous growth taking place in the understand- 
Ing of measurement processes, both at technical and consumer 
levels, it is suggested that new texts in the area might include 

Iscussions of current and anticipated trends. Consideration of new 
and/or old controversial topics, or the re-thinking of basic concepts 
Such as those underlying construct validity would be illustrative. 

urther, reference might be made to the writings of recent critics 
of educational and psychological testing. This would serve to caution 

е beginning student as to the limitations of evaluation procedures, 
and to place in proper perspective the kinds of justified and unjusti- 
4€ Common criticisms made by the frequently unsophisticated 

Public defenders.” 


393 


394 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Although the text is brief, one wonders what functions are served 
by the inclusion of chapters on the measurement of (a) socioeco- 
nomic status and (b) health and physical fitness. Space might better 
have been devoted to an extended discussion of possible interaction 
effects of the socioeconomic variable with the measurement process. 
Since most colleges and universities now offer required formal course 
work in the second area, the inclusion of this chapter is therefore 
helpful in only a limited number of instances. At best the material 
might have been treated in the specific chapter on the measurement 
of outcomes in physical education and recreation. 

One of the primary criteria to be considered in evaluating an 
introductory text is the extent to which instructional goals can be 
met with a minimum of explanation and supplementation by the 
instructor. The present text is more than adequate in this respect. 
Chapters 9 and 10 provide a lucid presentation of basic test and 
item types with numerous examples and helpful suggestions for test 
construction. This material finds excellent expression and applica- 
tion in ten chapters devoted to measurement and evaluation in spe- 
cific subject-matter areas. 

In summary, this excellent text, which obviously synthesizes the 
considerable experience and knowledge of its authors, should find a 
wide audience in introductory measurement and evaluation courses. 

Davi А. PAYNE 
Syracuse University 


A Model of the Mind by Gerald 8. Blum. New York: John Wiley & 
Sons, Ine., 1961. Pp. xi + 229. 

The full title of this work by Gerald Blum states that the model 
of the mind is to be “explored by hypnotically controlled experi- 
ments and examined for its psychodynamic implications.” The 
model concentrates on the processes which intervene between 
stimulus and response, the contention being that it is not enough for 
psychologists to know that given a particular stimulus the organism 
will react with a predicted response. Blum wants to put the “mind” 
back into psychology, and he hopes to accomplish this by studying 
through hypnosis the mechanisms which underly observed phe- 
nomena. 

Specifically, the book begins by describing a conceptual model of 
the mind. By means of the model, an attempt is made to marry 
psychological and psychoanalytical functions such as “cognitive, 
“affective,” and “inhibitive” with concepts borrowed from electrica 
engineering and computer technology. These other concepts include 
such terms as “signals,” “circuits,” and “feedback loops." To use 
Blum’s words, “In sum, the model, as well as its language is quite 
eclectic. But eclecticism runs the risk of not pleasing anyone more 
than a little and offending everyone quite a lot, so it is best to pro 


BOOK REVIEWS 395 


sed at once with a demonstration of the model's heuristic value in 
аз of research" (p. 19). This he does by means of twelve experi- 
in which hypnosis and the Blacky pictures are used to explore 
ntal states of ten college student subjects. 
в experiments, described in Part B, touch upon such topics as 
tion of associative recall, inhibition of extraneous perception 
je presence of anxiety-laden stimuli, and affective response 
sity in response to similar stimuli differing only in vividness. 
experiments were carefully performed and reported in adequate 
il. Blum is a cautious investigator and well aware of the limita- 
f his work; indeed, he urges the reader not to jump to con- 
s for he says, “These are the beginning, exploratory steps in 
zram dedicated to the evaluation of a molecular theory ulti- 
ately capable of integrating within a common framework phe- 
ena ranging from simple bodily sensations to complicated 
hopathological symptoms. In a sense they [the experiments] 
d be seen as demonstrations of the model's susceptibility to 
i) entation and its capacity to generate intriguing hypotheses” 
Part С of the book brings us back to a theoretical discussion of 
value of the model. The chapters, which relate the twelve 
usly-reported experiments with the model, describe the mental 
vities of the subjects in terms of the model. By way of example, 
action formation is “described in psychoanalytic theory as the 
evelopment in the ego of conscious socialized attitudes which are 
direct opposites of repressed wishes in the unconscious—in 
odel terms, simply the hyperfacilitation of signals from a certain 
tive network in place of a primitive one blocked by anxiety 
red inhibition” (p. 147). The basic question is whether the 
Second, or model, formulation is in fact simpler than the psycho- 
lytic explanation, and whether it will result in more productive 
ch, Unfortunately, no definitive answer can be given to this 
y at this time. Certainly, many psychoanalytically-oriented 
chologists would maintain that Freud’s formulations have re- 
in research and insights into this very area of mental func- 
It is equally obvious that for Blum, his co-workers, and others 
о prefer to describe their models of human behavior in engineer- 
terminology, this proposed model of the mind may be more 
ropriate. What can be said with certainty is that the reported 
eriments lend support to the potential value of the model. | 
— Psychologists interested in conducting research on the intervening 
iables between stimulus and response would do well to read this 
k, as would others who are interested in using hypnosis as ап 
imental technique. These two groups would benefit from the 
hnieal aspects of the work, but the reviewer would also like to 
amend the book to the more general group of researchers and 


396 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


report writers who would enjoy Blum's vivid and informal style of 
writing—a style not often seen in technical writing, but one which 
is most welcome. 

HAROLD BORKO 

System Development Corporation 

Santa Monica, California 


Creativity and the Individual by Morris I. Stein and Shirley J. 
Heinze. Glencoe, Illinois: Free Press, 1960. (Published by the 
Graduate School of Business, The University of Chicago.) Pp. 
428. 

In the introduction the authors note that creativity may be con- 
sidered within these frames of reference: the individual, the environ- 
ment, and the relationship of creativity to the transactions between 
these two. As the title indicates and as the authors admit, the book 
concerns itself primarily with the first of these. From an educational 
standpoint the decision not to focus on environment—including con- 
ditions which might foster or be responsible for the development of 
creativity—though perhaps understandable, is a lamentable one. 
Out of 419 pages, eleven are given over to summaries of the litera- 
ture on stimulating creativity. One can not condemn the authors for 
their decision, however, for with creativity as with many areas of 
education attention seems focused on the description of what is 
rather than on measurement of attempts to change what is to an 
hypothesized what might be. 

More than 300 articles and books are abstracted in the book. 
There is a consideration of heredity, the nervous system, age, early 
experiences, religion, cognitive factors, personality characteristics 
and motivating factors, and psychopathology as major areas in the 
study of creativity. Perhaps of greatest interest to the reader in 
education and psychology are the sections on empirical studies, to 
be found in chapter nine, “Personality Characteristics and Moti- 
vating Factors,” and all the contents of chapter twelve, “Stimulating 
Creativity.” 

An unfortunate inadequacy of the book as a research source for 
those in education is the index. A spot check reveals the ignoring 0 
the following under the category, Education: page 268, a report of 
the relationship between high achievement in graduate work toward 
the Ph.D. and later productivity and creativity; page 270-1, а con- 
sideration of student-dominant vs. instructor-dominant methods in 
brainstorming including student reaction to the methods; and, page 
274, a comment on the need for studies of factors involved in talent 
loss with the specific mention of school drop-outs. Perhaps even 
more inexcusable is the lack of reference under the category Barron- 
Welsh Art Scale to pages 262 and 263 where the scale is mentio 
in two reports of studies by Barron. 


BOOK REVIEWS 397 


ht of the current surge of interest in creativity, there is little 
оп that the book is a contribution. From the point of view of 
sons in education and psychology one wonders whether, if the 
ok had been sponsored by а Graduate School of Education rather 
ш by a Graduate School of Business, its potential usefulness 
have thereby been increased. Although the content would 
the same, attention to indexing in these areas would probably 
been improved. 
{һе whole, Creativity and the Individual is а good first source 
г those interested in pursuing studies in creativity. Although most 
the major works reported will probably have been encountered 
the researcher already engaged in study in this area, it does 
ovide a check for background information. у 

GroncE І. BROWN 

University of California, 
Santa Barbara 


ersonality: A Behavioral Science by E. Earl Baughman and 
George S. Welsh. Englewood Cliffs, New Jersey: Prentice Hall, 
—Tne., 1962. Pp. 566. 
"In a personality textbook market flooded with literature on “how 
pair broken egos,” the appearance of Baughman and Welsh’s 
tb deserves a resounding “Amen!” Here, not à single chapter is 
ted to “self-analysis” or to “faulty adjustment patterns;” nor 
this book, as do most of the others in the field, parade a host of 
mality theories in cafeteria style. The conventional chapter on 
hopathological conditions—how to recognize, cope with, avoid, 
exercise them—is condensed in this text to a two-page list of 
ehavior disorders; it is neatly tucked away, where it belongs, 
“Notes and Suggested Readings” section. In short, the authors 
d set before themselves the tasks of reversing the current trends 
placing emphasis on adjustment in personality courses and of 
loctrinating students of personality with а particular theoretical 
tof view. They have succeeded admirably. ) 
This book was written in an attempt to arouse the interest of 
beginning student of personality research in studies that are 
; d to do this by 
Шу weaving historical and theoretical background 
empirieal research into their discussion of many important 
The ground they cover is wide. The book is divided into 
parts: (a) Personality as a Scientific Study, (b) Personality 
| (с) Personality Description and Assessment, and 


(Chapter 5); it evaluates the impact of family environment 


s development (Chapter 6) and acquisition of role behavior 


398 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


(Chapter 7); and it stresses the importance of extra-familial de- 
terminants. The chapters which do not focus on the individual 
per se are written on a level that is designed to maintain the atten- 
tion of even the most exacting readers. Their discussion of the 
scientific method in the study of personality is quite good, and 
their handling of such issues as the base-rate problem in predietion 
and clinical vs. statistical prediction is excellent. 

The major portion of this book was written by Baughman, and 
only two chapters were written by Welsh. Almost certainly Welsh 
must be responsible for the discussion on the prediction of behavior 
(Chapter 15), and probably he contributed heavily to the writing 
of the chapter on Anxiety (Chapter 13). Together, these two au- 
thors are about as unlikely a pair as ever collaborated on a book. 
Baughman is known primarily for his contributions in the Rorschach 
sphere and Welsh’s major work has taken him to the other side of 
the subjective-objective continuum in his preoccupation with MMPI 
folklore and literature. Their common ground, however, aside from 
the University of North Carolina campus they share, seems to be 
that they both are committed to a “hard-headed” empirical ap- 
proach to behavioral phenomena. 

Their style of writing is quite readable, and the format of the 
book is adequate except for a photograph of Hermann Rorschach 
(p. 289) which resembles a card on a projective test and undoubt- 
edly was included to frighten freshmen. Certainly, the teacher who 
would like to introduce his undergraduates to the study of personal- 
ity as a behavioral science will find this text more than sufficient 
to meet his needs. The teacher who would like to use this book on 
a graduate level would be well advised to supplement it heavily 
with other readings. 

BENJAMIN KLEINMUNTZ 
Carnegie Institute of Technology 


The Assessment of Employee Morale by Ronald Paul Yuzuk. Bu- 
reau of Business Research Monograph Number 99. Columbus, 
Ohio: Ohio State University, 1961. Pp. ix + 67. 

This soft-covered monograph, one of the Ohio Studies in Per- 
sonnel, is essentially a study in methodology. Previous inventories 


of E ket ne been characterized by measurement of а 
general, diffuse attitude of the empl is k, rather 
than by his attitu ployee toward his work, 


| des toward specific, definable factors of his job. 

The hypothesis of this monograph is that this general factor is 8 

function of the type of items usually included in inventories rather 
than a function of the construct itself, 

To determine the effect of the type of items used, the author 

constructed two inventories, one evaluative and one descriptive. 

The former inventory is similar to existing inventories in format, 


BOOK REVIEWS 399 


while the latter is designed to be more objective and behavioral in 
nature. These inventories were administered to the employees of an 
electrie coil manufacturing company, and the results factor-ana- 
lyzed and correlated with selected criteria that were considered 
indices of employee morale. 

The results were generally as predicted by the author. A general- 
bias factor, which accounts for a major portion of variance in the 
correlation matrix of the Evaluative scale, is not found in the 
Descriptive scale. This general factor represents а “non-interpre- 
table, diffuse employee attitude” which is unrelated to any of the 
criteria used in the study, but which influences every Evaluative 
factor with the exception of Job Satisfaction. 

The conclusions of the study are that a general attitudinal bias 
pervades and distorts evaluative-type inventories of morale; that 
the proper measurement of employee morale requires the determina- 
tion and suppression of this general bias; and that, if determined 
in a bias-free form, certain dimensions of morale relate to selected 
factors of employee performance. 

It is important when considering these conclusions to note the 
limitations of the study, a number of which are mentioned by the 
author. For one thing, the final scales included only items which 
tended to relate to the criteria, and the scales therefore reflect only 
these specific factors of morale. Whether these scales are actually 
a measure of over-all morale has not been determined. Another 
problem area not fully investigated is the loss in reliability, as 
estimated by а Kuder-Richardson formula, in the Descriptive Form 
factors as compared to the Evaluative Form factors. It may well 
be, as the author surmises, that this is due to the decreased homo- 
geneity caused by suppression of the general-bias factor; however, 
there also may be an increase in the error of measurement, which 
would be reflected in test-retest reliabilities. 

These limitations, however, only serve to point out the need for 
further research in this area. The author has pinpointed a weak- 
ness in the techniques for assessment of morale of employees and 
has presented a possible remedy. It is hoped that this monograph 
Will stimulate further investigation and lead to additional improve- 


ments in methodology in this important area. 
J. H. RAINWATER, JR. 


County of Los Angeles | 
Civil Service Commission 


Design for a Study of American Youth by Project TALENT Staff: 
John С. Flanagan, John Т. Dailey, Marion F. Shaycoft, Wil- 
liam A. Gorham, David B. Orr, and Isadore Goldberg. Boston: 
Houghton Mifflin Company, 1962. Pp. 240. Cloth edition, $4.00; 
Paperback edition, $1.95. 


40 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


When U. 8. Commissioner of Education, Lawrence W. Derthick, 
launched what subsequently has become known as Project TAL- 
ENT, he knowingly put into the hands of American psychologists 
and researchers not only a monumental and exciting challenge, but 
also the funds necessary for such a task. To describe herein the 
processes through which Project TALENT evolved and finally 
emerged as the most extensive and intensive longitudinal study 
of youth ever undertaken would be wholly inadequate and inap- 
propriate, inasmuch as it is so admirably done in Design for a Study 
of American Youth. 

The TALENT staff, led by former Air Force Colonel John C. 
Flanagan, has produced this little volume as the first in a series of 
reports describing the genesis of the Project; the format of the 
testing program; а composite of the American secondary school; 
the characteristics, goals, and abilities of American youth at the 
beginning of the ’60s; a follow-up of those same youth as they move 
into and through their adult years. Design, therefore, does exactly 
what its name implies. It provides the reader with a remarkably 
clear picture of exactly what Project TALENT set out to do. Like 
all good expository writing, it provides specific answers to the-ques- 
tions: who? what? when? where? why? and how?. 

Chapters 1 and 2 describe the “when?” and the “why?”, together 
with the “who?” involved in planning, advising, and developing the 
project. Chapter 3 describes the method and care used in sampling 
ten million youth in 26,000 schools, with a resulting half-million 
subjects in about 1000 selected secondary schools. 

Chapters 4, 5, and 6 provide the reader with the rationale for 
the tests which were developed for Project TALENT, together with 
an adequate sampling of representative test items. Similarly, Chap- 
ters 7, 8, and 9 deal with the personality inventory, the vocational 
interest inventory, and the inventory of background factors known 
as the “Student Information Blank.” 

How the Tests Were Given” is the title of Chapter 10. Is it not 
encouraging to find such a straightforward title? Chapter 11 extols 
Ea) contributions made by the Iowa Electronic Test Scoring 
чта иа Document Reader, and by the electronic com- 

The book concludes with two quite brief and somewhat inade- 
канш у: Vio ies y d which were used to obtain cae 

schools an е gui 1 na. 

few pages анаа A rene ир, together with a fi 
though parts of the book were written b individual mem- 
bers of TALENT staff, it has the unity of ES mind—of a single- 
ness of purpose—that is, to tell its story so that any reader on any 
continent may find out exaetly what the Design for a Study of 
American Youth actually has been, Such was the purpose of this 


BOOK REVIEWS 401 


little book. The Staff of Project TALENT are to be congratulated 
for this lucid report. 

ROBERT C. AUKERMAN 
University of Rhode Island 


Readings in Guidance: Principles, Practices, Organization, Admin- 
istration by Lester D. Crow and Alice Crow (Editors). New 
York: David McKay Co., Inc. Pp. 626. 

The present book of Readings in Guidance differs from other 

Tecent Readings texts (Peters, McGowan, Patterson) in that it 

attempts to cover the whole gamut of guidance services. 


“The editors have attempted to select articles and publish ma- 
terial that would present and help clarify the guidance respon- 
sibilities of counselors, teachers, administrators and parents.” 


Such broad purposes account for the large number of topics (15) 
and articles (95). Because the emphasis is upon range rather than 
depth the book will be most useful in introductory courses. 
Among the fifteen, there are chapters relating to Principles of 
Guidance, Values and Discipline, Group Guidance, Evaluation in 
Guidance, Occupational Information and Vocational Guidance, 
Guidance in Elementary School, and Guidance in Junior High 
School. The chapters on school practices would also be useful to 
People interested in organization and administration of guidance 
‘Services. The articles are generally quite recent and represent a 
Wide assortment of journals and authors. 
Although this reviewer finds considerable congruence between 
the author’s purposes and the content of the text, there are several 
Questions which need to be raised. b 
The first question relates to the decision to omit bibliographical 
Teferences in order to print more articles. To some extent this de- 
teats one of the purposes of a book of readings. If the student must 
80 to the journal to get the bibliography, then the journal article 
could just as well have been assigned in the first place. 

The second question pertains to Chapter Fifteen and the state- 
ment of “Standards for School Counselors.” Although the authors 

clude a footnote which reminds the reader that this is only a ten- 
tative statement and that further material will follow in the way 
9! position papers, the reader might not always keep such footnotes 
m mind. Considering the probable excitement caused by its initial 
Publication, this statement of standards could easily have been 
eliminated and saved for a future edition. А 
„A third question is related to the lack of research articles. The 
Chapter on Evaluation is almost exclusively concerned with tests 
And measurements. It would be helpful if some chapters had been 
Tounded out and strengthened with pertinent research articles. 


400 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Despite the questions raised, the book is a handy reference for 
introductory courses and for those who have an interest in the 
general area of school guidance services. 

GILBERT D. Moors 
State University of New York 
at Buffalo 


Appraising Vocational Fitness (Revised Edition) by Donald E. 
Super and John O. Crites. New York: Harper and Brothers, 1962. 
Pp. xv + 688. $8.95. 

In their newest revised edition of Appraising Vocational Fitness, 
Super and Crites have given us а book which should serve as а 
definitive text for any college level course in the area of vocational 
guidance and counseling. The writing and style are crisp, clean, and 
lucid for the most part with much lean and little fat. The book is well 
bound and the printing job is definitely superior. Approximately 4 
per cent of the 688 pages are devoted to a discussion of test construc- 
tion, standardization, and validation. Another 73 per cent is spent 
discussing specific tests listed under such chapter headings as: In- 
telligence, Proficiency, Clerical, Manual, Mechanical, Spacial, Es- 
thetic and Artistic, Musical, Vocational Interests, Personality, 
Standard Batteries with Norms for Specific Occupations (e.g. GATB, 
DAT), and Custom Built Batteries for Specific Occupations (e.g 
lawyer, engineer, etc.). The final chapters of the book are devoted 
to use of test results and counseling, to report preparation, to illus- 
trative cases, and to discussion of three methods of vocational ap- 
praisal: (1) clinical, (2) psychometric profile and (3) psychometric 
index. In the appendix they also list, with addresses, 25 leading test 
publishers and scoring services and thoughtfully provide the reader 
with a separate index for author, subject, occupation, and test. 

In а more critical vein, one would hope that in future editions the 
topic and paragraph titles or headings would be printed in a darker 
and more readily visible type. One quite glaring error, especially in 
а revised edition such as this, is the inclusion and discussion of the 
Moss Medical Aptitude Test as the standard instrument for the se- 
lection of medical students, This test was replaced over a decade ago 
by the Medical College Admissions Test. 

It was also hard for the reviewer to understand why 8,000 to 9,000 
words were spent on a non-empirically developed test such as the 
Edwards Personal Preference Schedule, which Super and Crites 
themselves cite as having little if any congruent validity, whose role 
in vocational selection and guidance is limited “mainly as a promis- 
ing instrument for research,” and whose “needs” are based only оп 
the face validity of the items which are subject to considerable con- 
fusion in interpretation. Yet such a test as Gough’s empirically- 
developed California Psychological Inventory (published in 1956), 


BOOK REVIEWS 403 


which in the reviewer's opinion has a very promising future in guid- 
ance and counseling work with normal adolescents and adults, was 
lamely dismissed with “if space had permitted the CPI would also 
have been discussed." The Guilford-Zimmerman Temperament Sur- 
vey was also dismissed in similar fashion. 

When the 255 references listed at the end of the chapter on In- 
telligence (chosen at random) were surveyed, only 7 per cent had 
been published within the last twelve years. This suggests that to 
some extent the book is dated. 

Despite a few such criticisms, Super and Crites have, in overview, 
given us an excellent text which is essentially traditional and con- 
servative in approach. Their reviews of vocational tests and proce- 
dures are frequently outstanding and always judicious, thoughtful, 
and impartial. 

Victor В. CLINE 
University of Utah 


Anziety in Elementary School Children by Seymour B. Sarason, 
Kenneth S. Davidson, Frederick F. Lighthall, Richard R. Waite, 
and Britton К. Ruebush. New York: John Wiley & Sons, Inc., 
1960. Pp. viii + 351. $6.00. 

The authors of this book have undertaken a formidable task— 
to describe the results of six years of research in a logical order and 
at the same time maintain a style readable to school personnel as 
well as behavioral scientists. After wallowing in 83 pages of intro- 
duction, hypotheses, and literature review, the hard-working teacher 
may well decide it is not worth his effort; but read on, teacher, it is. 

The authors have focused their attention on anxiety, specifically 
test anxiety, as a possible source of the “relationships and discrep- 
ancies between performance and potential” of children in the ele- 
mentary school. In the first three chapters of introduction, hy- 
Potheses, and literature review, terms are defined and the theoretical 
foundations of their research are carefully outlined. The review of 
the literature, 56 pages, is divided into four sections: literature deal- 
Ing with the concept of anxiety, differentiation of anxiety and fear, 
differentiation of anxiety and phobia, and effects of anxiety on in- 
tellectual performance. It is evident that the theoretical orientation 
of the authors is primarily Freudian, with consideration of the im- 
Portant contributions of Sullivan, Jersild, and others. — | 

Chapters 4, 5, and 6 describe the formation of the anxiety scales, 
construction of a built-in lie scale, and the initial validity studies. 
The TASC (Test Anxiety Scale for Children) consists of thirty 
items concerning the worries of children about, and their reactions 
to, classroom tests given by their teachers. The GASC (General 
Anxiety Scale for Children) has 45 items about other worries and re- 
actions. (Both seales are reproduced in their entirety in the book.) 


404 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Both sets of questions are stated positively; that is, a “yes” answer 
increases the anxiety score. All “yes” answers are on the left and all 
“no” answers on the right, Thus, an acquiescence set would tend to 
inflate the anxiety score and a position set would tend to produce а 
score either spuriously large or spuriously small. The authors’ justifi- 
cation of this format appears to be an inadequate substitute for 
constructing the scales so as to control the effect of response sets. 
Construct validity was determined by correlating (1) the TASC 
and teacher’s ratings of pupils on 17 questions of content similar to 
the TASC; (2) the TASC and mean score of pupils on several 
achievement tests; (3) the TASC and IQ; (4) teacher's ratings and 
achievement; (5) teacher's ratings and IQ; (6) TASC and GASC; 
(7) GASC and achievement; (8) GASC and IQ; (9) gains over time 
on Otis Alpha, Otis Beta, and Davis-Eells tests with TASC; and 
(10) the seven tasks of the Primary Mental Abilities tests and the 
TASC. These correlations were then examined with regard to the 
hypotheses stated in Chapter 2. Sarason and his associates conclude 
that these studies support their original hypotheses and indicate an 
“encouraging degree of validity,” particularly for the TASC. 
Chapters 7, 8, and 9 indicate the results of studies in which the 
authors used these two anxiety scales with other instruments. The 
nature of a task was found to determine whether or not anxiety 
would be interfering or facilitating. The authors are at a loss to ex- 
plain why high-anxious subjects do better on tasks which are called 
facilitating; however, if they had given more than passing considera- 
tion to the Taylor and Spence concept of anxiety as synonomous 
with drive, a better discussion of their results might have been forth- 
coming. Mothers were found to be more defensive than fathers in 
answering questions concerning the worries, fears, and overt re- 
sponses of their children. Studies of personality characteristics by 
means of anatomical responses to the Rorschach and details of hu- 
man figure drawings indicated that the authors’ body-image hy- 
pothesis has some merit and deserves further study. Various studies 
indicated sex differences, which the authors discussed in detail. 
In Chapter 10, “Implications for Education,” the authors point 
out that the major objectives of their research were to become able 
(1) to pick out those children whose school performance suffers be- 
cause of disabling personality factors, (2) to pick out those children 
whose early school performance appears adequate but whose later 
school performance and behavior will reflect personality disturbance, 
and (3) to determine the ways in which the classroom situation сап 
be used to help those children who have disabling reactions and 
attitudes toward school” (p. 263). They conclude that these objec- 
tives cannot yet be realized because (a) defensiveness in answering 
questionnaires has not been studied adequately; and (b) the validity 
of the scales for girls has not been well established. The reviewers 


BOOK REVIEWS 405 


wonder whether the objectives could be realized even if these two 
factors were remedied. It would seem that methodological and per- 
sonality factors other than anxiety would have to be investigated, 
especially to satisfy the third objective of their research. 

The book concludes with eight appendices; the first is a study of 
the effects of sequential administration of the anxiety scales, and 
the remaining seven reproduce some of the other instruments used 
in their research. 

Viewed in its entirety, the book presents significant work in a 
field certainly no longer barren, but still sparsely populated. The 
discussions of results are quite lengthy and seem at times apologiae 
for failure to control some probably pertinent factor; however, the 
book should be read by anyone doing research in this area. This 
scholarly volume should certainly give impetus to further work on 
anxiety in elementary school children. 

Глу В. Sruckr, ELINOR WINE, AND JULIAN С. STANLEY 
Department of Educational Psychology 
University of Wisconsin 


Counseling: Readings in Theory and Practice by J. McGowan and 
L. Schmidt. New York: Holt, Rinehart, and Winston, Inc., 1962. 
Pp. xiv + 623. 

The purpose of this book is to emphasize the counseling phase 
rather than the academie or vocational guidance aspect of counsel- 
ing. Approximately 70 per cent of the articles are concerned with 
philosophical foundations or theoretical bases. However, the remain- 
ing articles report research findings which are not necessarily re- 
stricted to counseling techniques. For example, Chapter 7, Com- 
parative Techniques and Counseling “Style,” contains nine articles, 
seven of which explore the role or style of the counselor rather than a 
Specific technique. й 

The procedure for selecting articles is а basic problem for all edi- 
tors of books of readings. MeGowan and Schmidt used the following 
two “convictions” to select the articles: (a) "the relationship which 
develops between the counselor and client is the most significant 
aspect of counseling”; (b) “each counselor will develop both an 
attitude toward counseling and а technique that is consistent with 
What he is as a person or what he hopes to become.” They further 
state that “counselors are eclectic in the sense that they accept and 
adopt attitudes and techniques from different theories and individ- 
uals which are complementary, comfortable, and hence perceived as 
Sensible’ to them as persons.” It can be seen from the above state- 
ments that the articles reflect a generalized approach to counseling 
rather than a specific school or individual orientation. 

The readings consist of 66 articles and one mimeographed report. 

he articles are drawn from 12 different journals. The four main 


400 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


sources are: Journal of Counseling Psychology (24) ; Personnel and 
Guidance Journal (19); Journal of Consulting Psychology (8); 
American Psychologist (4). The book is divided into three parts, 
each of which is preceded by an introduction that delineates the 
major problems and is concluded in a set of auxiliary reading sug- 
gestions, This arrangement adds strength to the book, for it assists 
the reader in forming the interrelations between various articles. 

Part One, “Origin and Development of Counseling,” consists of 
4 chapters and 23 articles. With one exception, the articles are con- 
cerned with philosophy, theory, or counselor requirements. Part 
Two, "The Counseling Process,” consists of 8 chapters and 38 articles, 
half of which report research findings. The articles discuss expecta- 
tions, personality, style, diagnosis, duration, communications, tests, 
and counseling results. The diversity of opinions and views reflected 
in the articles should give the reader an idea of the complexity of 
the rationale behind the counseling process. Part Three, "Profes- 
sional Issues,” consists of 3 chapters and 5 articles. The chapter on 
Ethics and Legal Considerations is well executed, especially the in- 
troductory remarks. 

The general conclusion that one can reach is that McGowan and 
Schmidt reached their goal of producing a book of readings that 
focuses on the total counseling process rather than on such adjuncts 
as personal or vocational counseling. This goal was reached because 
the editors were not necessarily “eclectically” oriented but used as 
their criteria the pertinent issues and factors inherent in the counsel- 
ing process. The only demur one can raise is that Patterson’s com- 
ment on the Goodstein and Grigg article should have been included. 

Henry KACZKOWSKI 
University of Illinow 


Psychology of Personnel in Business and Industry by Roger Bellows. 
Englewood Cliffs, New Jersey: Prentice-Hall, Inc., 1961. Pp. Vi 
+ 474. $10.00. 

In this revision of a standard, well-known text in industrial per- 
sonnel psychology, new topies introduced include group dynamics, 
concepts of employee motivation, and applications of social psy- 
chology to industry, Bellows continues to emphasize the importance 
" за morale, and employee attitudes. However, the 
bulk of the book is devoted to such traditional tools as interviewing 
vita and other selection techniques, job analysis and evaluation, 
in eiye Numerous examples are given of the use of these tools 
i Siren is placed оп restart by citing numerous studies in the vat- 
lous areas covered by the book. Unfortunately, the reader is told about 
the studies, but not much attention is given to research methodolgy- 
With its emphasis on quoting research studies, the book appears 


BOOK REVIEWS 407 


rather pedantic. This is reinforced by the discussion and multiple- 
choice questions at the end of each chapter. The discussion questions 
require simple recall, and the choice questions tend to be the recogni- 
tion type. Emphasis is lacking on evaluation and critical judgment. 

Though some readers may enjoy reading historical research studies 
on phrenology, physiognomy, graphology, and other obsolescent ap- 
proaches, the reviewer wonders whether this is efficient use of space. 
Collections of such studies would seem to be more appropriately in- 
cluded in a book of historical readings. Undergraduate students as 
a captive group may accept the material as informative, but indus- 
trial personnel workers are apt to question the applicability of the 
research findings to their particular situation. 

Although Bellows documents much of what he says, the reviewer 
is curious to know the source for the statement that “... it has been 
estimated that half of the jobs in the United States require no educa- 
tion for successful fulfillment of duties” (р. 190) ог“... less than a 
fifth specify high school graduation as a prerequisite for employ- 
ment” (p. 191). 

Two exceptionally helpful appendices are the brief descriptions of 
22 outstanding human management research programs and an an- 
notated list of sources and services for obtaining information. 

Despite some of the shortcomings cited above, this book continues 
to be one of the best texts available for an undergraduate course 1n 
personnel psychology. Its emphasis on objective research is vitally 
needed to counteract the subjective approach that is so pervasive n 
industry. 

WILLIAM COLEMAN 
Trans Pacific Management Consultants 


Study and Sueceed by Lyle Tussing. New York: John Wiley & Sons, 
Inc., 1962. Pp. ix + 157. $2.95. 

Study and Succeed, which might well have been titled “College 
Orientation, Etiquette, and Study Methods,” is a well-written, clev- 
erly-illustrated, generally sound presentation of study skills sprinkled 
with advice to the student on ways of adjusting to the college 
environment. The book comprises 12 chapters: The College Environ- 
ment, Learning and Study, Methods of Improving Study, Compre- 
hension of Words, Reading Improvement, Finding Material, Organiz- 
ing Materials and Taking Notes, Presenting Written and Oral 
Material, Taking Exams, Conditions and Techniques Related to 
"Thinking and Learning, Studying Specifics, and Educational and 
Vocational Goals. A number of self-tests, check lists, and exercises 
м throughout the book actively involve the student in its con- 

nt. 

Viewed in perspective of similar books in this area, Tussing's 
Work appears less substantial than such books as Robinson's E ffec- 


408 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


tive Study but more substantial than the pamphlet material on study 
methods currently available. The book would probably be unsatis- 
factory for the full semester how to study course and unappetizing 
{о students seeking short-term help. However, the book might be 
used effectively in several ways: as a text in a short course on study 
skills; by the student highly motivated to work on his own; by stu- 
dents under tutorial-type supervision. 

Users of the book will probably want to be apprised of the fol- 
lowing points which, while not marring the soundness of the book, 
do detract from the over-all presentation: The student will probably 
be confused by conflicting statements regarding written reports. For 
example, in listing suggestions “made by students for improvement 
of schoolwork,” the author writes (p. 36) : 


“Tt is a waste of time to wonder how many pages a report should 
contain to impress the instructor. Instead the student should ask 
himself how much he wants to add to his knowledge. The answer 
should determine the length of the assignment for him.” 


Later in the book (p. 92), however, the opposite point is made: 


“... Before beginning any writing project, the student must know 
what the teacher expects as far as length and form are con- 
cerned. Is the paper to be limited to a certain number of words? 


Another difficulty pertains to the fact that references on studying 
and remediation in specific subject areas are not cited by the author. 
Such references are particularly important since Tussing’s treatment 
of these areas is aborted. Even in areas treated more substantially 
(e.g., points of grammar, oral reports), references to sources giving 
additional information would seem appropriate. 

A final point concerns the self-test exercises. While many of these 
exercises are excellent, they have not been used consistently to ex- 
pose the student to well-constructed measuring devices which also 
ee relevant aspects of the subject matter. Witness these items: 

А. talk given before a group of people is only an l 
"Newspaper articles are EU eU indexes." 
(True-false) “Topic and sentence outlines are different in every 
way but one." 


eer QNT tests require the same amount of review 


In conclusion, the over-all presentatio i is sound and 
there is little doubt that student E us cordi be im- 
proved by application of the prineiples outlined by Tussing. 

REGINALD L. Jones AND Rosert R. BROWN 
Miami University 


BOOK REVIEWS 409 


Psychology in Teaching (Second Edition) by Henry P. Smith. En- 
glewood Cliffs, New Jersey: Prentice-Hall, Inc., 1962. 

Smith’s Psychology in Teaching is a textbook designed to give the 
teacher and prospective teacher direct help in meeting actual class- 
room problems by providing a background on the nature and needs 
of the child. In both intent and content this book resembles many 
other educational psychology textbooks. 

The book comprises 16 chapters encompassing four broad sec- 
tions: The Professional Teacher’s Skills and Characteristics, Facts 
and Trends of Growth and Development from Infancy to Maturity, 
How and Why People Learn, and Motives and Problems in the Life 
of the Individual. Each chapter has a section, Problems and Pro- 
jects, and a two-level annotated bibliography: (1) “Suggested Read- 
ings" (usually appropriate articles from one or more of the bóoks of 
readings in educational psychology) and (2) “Additional Resources" 
(often chapters in yearbooks or textbooks and articles). 

The emphasis throughout the book is on “principles,” and in pre- 
senting the subject matter the author eschews the need for (1) 
treatment of psychologieal terms, (2) "review of the historical 
development of various phases of psychological knowledge," (3) 
mention of important contributors, and (4) concern for theories 
raising "important questions whieh might confuse rather than en- 
lighten the student.” As a result of these omissions the text will 
appear to some readers to be lacking in scholarship and to be plati- 
p in nature; to others the omissions will undoubtedly augur 

ell. 

In his general exposition of the subject matter the author presents 
the results of a number of experiments. Noteworthy in such cases 
is the detail with which these studies are presented and the clarity 
of inferences drawn from them. Disturbing, however, are the broad 
undocumented statements appearing throughout the text. е 

For example, many psychologists will disagree with the author's 
unqualified generalization that common principles of learning are 
identified and summarized by Thorndike's (source not specified) 
four laws of learning (the law of effect, the law of exercise, the law 
of readiness, and the law of belonging)—laws which Thorndike him- 
self modified in the thirties (Hilgard, 1956, p. 25 ff). Other profes- 
sional readers will question the author's use of such terms as “nerv- 
ous breakdown,” “love,” “genius,” (all undefined by the author and 
all likely to carry surplus meanings), and “feebleminded.” 

_ The following cursory and equivocal statements about item analy- 
sis, presented in the context of discussing teacher-made objective 
tests, epitomize for this reviewer much of the flavor of the book: 


“, . . Make every effort to see that the questions or statements are 
clear and unambiguous. . . . you can check for ambiguity by ar- 


410 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


ranging the papers in order from high to low score, and then 
choosing five or six of the highest-ranking papers and the same 
number of lowest-ranking papers. If you find that certain ques- 
lions were missed more often by the better students than by the 
poorer students, the chances are that the superior students mis- 
interpreted those questions. Any questions on which both strong 
and weak students show an equal tendency to make errors is of 
no help to you in appraising the relative achievement of stu- 
dents... .” 


The book has numerous creditable points—some of the sections on 
learning are well done and the over-all point of view is good. The 
author’s frequent focusing on reasons for teacher knowledge of cer- 
tain psychological areas is one of the book’s unique features. 

Undoubtedly, many will find this book useful as a text. It will 
certainly be welcomed by instructors interested primarily in expos- 
ing their students to a series of psychologically-oriented generaliza- 
tions. Instructors preferring a more rigorous presentation—reference 
to basic sources, ideas documented and in context—will probably find 
this book unsatisfactory. 


REFERENCE 
Hilgard, E. R. Theories of Learning. New York: Appleton-Century- 
Crofts, Ine., 1956, p. 25 ff. чу x pp 
REGINALD L. JONES 
Miami University 


The Sociology of Crime and Delinquency by Marvin E. Wolfgang, 
Leonard Savitz, and Norman Johnston. New York: John Wiley 
& Sons, Ine., 1962. 

The pronounced trend in sociological literature toward the ap- 
коган of collections of readings is given further (and essentially 
effective) impulse with the publication of this book. Between its 
covers are found many of the classic contemporary sociological 
ene materials in both the theoretical and empirical study of de- 
i жой It would therefore be of great value to the library of 
ded Уат ав well as to the individual research worker or stu- 
dent who wishes to become aware of the field as a whole with а min- 
imum of time-consuming legwork, 

Брена contention (in the Preface) that the collection could 
Um tio Mer ш crime and delinquency is perhaps more open 
о : f its n. A great difficulty with this collection (although perhaps 
z of its virtues as well) resides in the wide diversity of viewpoints 
E ubjects represented. For the mature reader this undoubtedly 
یه‎ i-e eno For the student, (particularly with little back- 
ground in the area) this challenge may be replaced by confusion 


"3 


BOOK REVIEWS 411 


because of an unfortunate tendency to treat many topics with a 
once-over approach. 

This is especially true in Section III, Methods and Techniques of 
Analysis. For instance, Cohn’s article on multiple factor approaches 
is essentially a negative criticism of the approach as such. This is 
certainly Cohn's privilege as a writer. It definitely represents а valid 
viewpoint. However, had the editors intended the collection as a 
text, they would have been wise to include a more neutral and ex- 
tended exposition of the approach to offset, Cohn's criticisms. 

The problem here is perhaps one of space. Any one short selection 
is usually not long enough to provide an adequate development ofa 
given research method. This, however, a text would be expected to 
do. Thus, Lander’s article, while interesting and informative, is cer- 
tainly not a definitive statement of the application of zero-order 
linear correlation methods to the study of delinquency data on an 
ecological basis. 

Editorial limitations appear to be responsible for one other minor 
defect in this otherwise excellent collection. In some instances more 
information about a study as a whole would have been helpful. There 
are selections which give little hint as to the characteristics of the 
larger study from which they were taken. Both the Glueck and 
Glueck selection on matching delinquents and non-delinquents and 
the Power and Witmer article on the formation of treatment and 
control groups would have benefited from more information as to the 
чаш study design and context which necessitated these proce- 

ures. 

Roworo Torco 
Rip Van Winkle Foundation 
Hudson, New York 


Principles of Comparative Psychology by Rolland H. Waters, D. A. 
Rethlingshafer, and Willard E. Caldwell (Editors). New York: 
McGraw-Hill Book Company, Inc., 1960. Pp. ix + 453. 

Here is an ambitious and, the reviewer believes, а successful at- 
tempt to encompass the many-faceted discipline of comparative psy- 
chology. Seventeen well-qualified specialists in animal behavior offer 
4 diversified approach including, along with the usual topics, such 
stable fare as discussions on behavior classification, theoretical foun- 
dations of comparative psychology, and research trends in compara- 
tive psychology. The fact that the volume was edited does not in- 

ere with maintenance of subject matter continuity. This was 

Possible, suggest the editors, because of the commonality of approach 

—8 transcending Zeitgeist providing the guidelines within which the 

individual authors wrote. 

The efforts of certain contributors are especially noteworthy. For 
*xample, the chapters on complex processes and comparative social 


412 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


psychology by A. J. Riopelle and J. P. Scott, respectively, were par- 
ticularly well done. Of special note here is the splendid practice of 
including more in the summary of a chapter than the usual rehash 
of the text. It is apparent that a deliberate attempt was made to 
integrate and offer thoughtful speculation beyond the material given 
earlier in the chapter. J. L. Fuller also made good use of this method 
of summarizing in his chapter on genetics and individual differences. 

The small amount of redundancy in subject matter coverage found 
in the present volume is perhaps an inherent weakness of edited 
books. No doubt some of this is due to the broad implications of cer- 
tain research, e.g., Harlow’s learning sets, Miller’s conflict theory, 
and Olds’ theory of brain stimulation—to mention only a few. But 
inadvertent superfluity resulting from overlapping coverage by two 
or more authors (e.g., as occurred occasionally in chapters on learn- 
ing and complex processes) is an inadequate substitute for broader 
representation of comparative psychology. 

It is difficult to say to what extent the field has been covered ade- 
quately in this volume. Certainly there are many gaps in the com- 
pilation of existing research, although this is alleviated somewhat 
by the rather extensive bibliography of approximately 800 refer- 
ences. The thoughts and research of some of the foreign investiga- 
tors were not slighted. In an informative and well-written chapter 
on sensory processes by E. H. Hess, approximately half of the refer- 
ences are of foreign research. Studies from Canadian, English, Ger- 
man, Russian, and Scandinavian sources were cited in several of 
the chapters. Considerable importance, and rightly so, was given the 
work of the European ethologists. Their more naturalistic orientation 
to animal behavior is seen as complementing the laboratory-bound 
work of many American comparative psychologists. A rapproche- 
ment between the European and American approaches is welcomed. 

1 As with most good books, this one caught the reviewer's imagina- 
tion in several places. A list of some of these more noteworthy ideas 
is offered in conclusion. (1) The schematic presentation of the three 
dimensions of heredity, environment, and temporal sequence pro- 
vided by 8. Ross and V. H. Denenberg is a helpful guide to the stu- 
dent attempting to gain a more sophisticated understanding of the 
nature-nurture problem. (2) The notion of “contextual learning” 
advanced by A. Н. Riesen emphasizes the importance of greater еп- 
vironmental control of behavior in higher organisms than in those 
lower in the phylogenetic scale, Dependence upon the contextual 
environment is seen as an important factor in successful discrimina- 
tion learning. (3) The classification scheme offered by J. P. Scott for 
recording and describing social psychological behavior helps consid- 
erably in ordering the kinds of observations made in naturalistic 
settings. Incidently, in Scott's cursory discussion of territoriality no 
mention is made of the rather dramatic evidence (as reported by 


BOOK REVIEWS 413 


Helmut K. Buechner, a zoologist at Washington State University) 
of the territorial behavior of the kob, a variety of African antelope. 
The male kob establishes a clearly demarcated area (especially dur- 
ing rutting season) and protects it steadfastly against the encroach- 
ments of other male kobs. 
Patrick J. CAPRETTA 
Miami University 


Adjustment to Visual Disability in Adolescence by Emory L. 
Cowen, Rita P. Underberg, Ronald T. Verillo, and Frank G. 
Benham. New York: American Foundation for the Blind, 1961. 
P. xiii -+ 239. Cloth edition, $4.50; Paperback edition, $2.50. 
This book summarizes a three-year research program inquiring 

into factors relating to adjustment in visually handicapped ado- 
lescents. The following basic questions were asked: (1) How does 
the adjustment of visually handicapped adolescents (experimentals) 
compare with the adjustment of their matched non-visually handi- 
capped counterparts (controls)? (2) How do the parents of the 
experimental and control subjects compare on measures of attitudes 
toward (a) child rearing, (b) blindness, (с) minorities; or on (4) 
understanding of the child? (3) What relationships exist between 
. measures of parent attitudes and understanding and adolescent 
adjustment? In addition, a small number of serendipitous points 
were treated (i.e., comparative adjustment of the visually disabled 
attending residential schools versus those living at home, relation- 
ships between level of adjustment and degree of visual handicap, 
and related concerns). ADs 

Principal among the major findings were: (1) no significant 

differences in adjustment among visually handicapped living at 

home, those attending residential schools for the blind, and the 
controls—a finding running counter to previous results in this 
area; (2) no significant relationships between maternal attitudes 
and child adjustment or between maternal attitudes and maternal 
understanding, although there were strong relationships between 
maternal understanding and child adjustment for both experimental 
and control subjects; (3) some indication that better adjustment is 

Eee with greater degree of disability for those subjects living 

at home, : 

In a real sense this book represents a fresh and methodologically 
sophisticated approach to old problems in this area. As the authors 
indicate in their penetrating review of the pertinent literature, 
many of the basic problems have been addressed previously but 
answers have been often equivocal. The major methodological short- 
Comings of past research have been deficiencies in instrumentation, 
improper application of psychometric instruments, and inadequacies 
in experimental designs. The authors, in commenting on problems 


44  EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


of instrumentation, note that many of the tests used in comparisons 


of the adjustment of visually handicapped with that of sighted 
individuals are inapplicable when used with the handicapped sub- 
jects. It is understandable, therefore, that considerable time was 
spent in the development of instruments (Situations Projective Test 
(A & B), Attitudes Toward Blindness, Attitudes Toward Child 
Rearing) designed specifically for use with visually handicapped 
persons. Unfortunately, these test-construction efforts were not 
always successful. This is acknowledged by the authors of the 
present volume, and they proceeded to appraise critically their test 
construction and experimental efforts. Thus, the data and findings 
are discussed in the light of such problems as response sets, low 
reliability and validity of instruments, failure to take account of 
sex differences, and matching difficulties. 

_ In the same critical vein this reviewer questions the I.Q. match- 
ing procedures and the nature of experimental designs used for some 
of the analyses. The problem of experimental design is one of strat- 
egy. Many researchers prefer procedures using carefully matched 
groups incorporating smaller n’s rather than less carcfully matched 
groups having larger n’s. It is quite probable, and in keeping with 
the reviewer's own experience, that the authors had to balance 
strategies involving use of as much of the data as possible against 
strategies involving use of less data in a more refined manner. 
Although higher critical values would have been required for re- 
jection of the null hypothesis in the latter case, interpretation of 
findings would have been less difficult. 

Viewed in perspective, the present; research represents a signifi- 
cant improvement over previous studies of adjustment and visual 
handicap and consequently provides a good base for further work. 

REGINALD L. JONES 
Miami University (Ohio) 


Educational Psychology (Second Edition) by Glenn M. Blair, 
R. Stewart Jones, and Ray H. Simpson. New York: The Mac- 
millan Company, 1962. Рр. xxiv +- 678. $7.00. 

This book is a revision of an earlier edition by the same authors. 
As a textbook the coverage is good and the presentation smooth 
and orderly. References are frequent and up-to-date (although 0¢- 
easionally pedantie), and illustrations are numerous. The format 
of the book is attractive. Frequent examples of the application of 
үр сом principles to actual school situations is an outstanding 

The book's twenty-two chapters are divided into six units: L 
Introduction, II. Growth and Development, III. Learning, IV. Ad- 
justment and Mental Hygiene, V, Measurement and Evaluation, 
and VI. The Psychology of the Teacher. Units II-V comprise the 


BOOK REVIEWS 415 


areas usually covered in educational psychology textbooks, although 
the section on Evaluation is more extensive than most. The unit 
on The Psychology of the Teacher is increasingly becoming a larger 
part of such texts. Sections titled “References for Further Study” 
and “Questions, Exercises and Activities” follow each chapter. 
The text leaves considerable room for enlargement by the individ- 
ual college teacher; this is as it should be since it is not possible 
for a book of this sort to have depth of coverage. While coverage 
is good and generally accurate, the presentation is not flawless; 
occasionally one finds questionable and misleading advice. An exam- 
ple of questionable advice is the following (p. 434): “The teacher 
May sometimes find out what is bothering the child by having him 
tell a story or write a theme on such topics as ‘What I Dreamed 
Last Night,’ ‘If I had Three Wishes,’ and ‘When I Was Most 
Afraid.'" There was also the suggestions that use of puppets and 
Psychodrama may be appropriate (p. 450). It seems to this reviewer 
that, while the above cited techniques may be discussed as examples 
of techniques useful in the treatment of certain childhood difficul- 
ties by trained professionals, attempts at use of these techniques by 
Most teachers would be unwise. Occasionally there is misleading 
information, e.g., the statement that the Columbia Mental Maturity 
Scale is an example of a well-standardized test (p. 492). Another 
example of misleading information is the following statement about 
the Wechsler-Bellevue test (p. 491): “There is, however, an excel- 
lent instrument, The Wechsler Bellevue Intelligence Scale, which 
tan be used for ages from ten to seventy." ; 
а Аз is the case with all texts of this type, the presentation is eclec- 
tic—the authors borrowing frequently from past and contemporary 
theory and research any view or finding having relevance to edu- 
cational practices. Since the pattern of psychological research on 
educational problems is checkered, extrapolations must perforce 
е many. On the whole the text’s authors have been judicious in 
their extrapolations and have covered well, in an up-to-date fashion, 
e major research findings and psychological principles likely to 


be useful to teachers and prospective teachers. 
Eu REGINALD L. JONES 


Miami University (Ohio) 


Fundamentals of Psychology by Frank A. Geldard. New York: 
John Wiley & Sons, Inc., 1962. Pp. 437. $7.50. 

he organization of material in this particular book appears to 

be the most significant contribution. The chapters are shorter, more 

clearly defined, and more numerous than found in most general 
Psychology texts. | 

Chapter 17 deals with the measurement of abilities, and for the 

"Ost part is given over to a general overview of types of tests. 


416 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


The only formula used was that pertaining to establishing the 
intelligence quotient. It would seem that an introductory course in 
psychology should give some space to identification of curves, 
unimodal and polymodal, standard deviations, and variability. 
While validity and reliability are discussed briefly, there is no 
mention of the coefficient of correlation, factor analysis, sampling 
error, levels of confidence, and the inferences to be made from 
statistical findings. This reviewer feels that even beginning students 
should be introduced to elementary statistical concepts in order to 
better appreciate outside reading. - 
The book is exceptionally good in the area of motivation and 
presents current research in the writings that are most meaningful 
and up-to-date. The book is scholarly, but somewhat austere. It is 
not given to a great quantity of diagrams, pictures, and tables. This 
last remark is not in terms of a criticism. The content has been well 
selected; however, sometimes humor and aesthetics can also help in 
the association of ideas to be learned by the beginning student. - 
This should prove to be a usable textbook in beginning psychology. 
Roy M. Fircu 

San Fernando Valley State College 


DUCATIONAL and 


SYCHOLOGICAL 


MEASUREMENT 


Editor: G. Frederic Kuder, Duke University 
Associate Editor: John A. Hornaday, Greensboro College 
Assistant Editor: Joan F. Hornaday 
Business Manager: Geraldine R. Thomas 


BOARD OF COOPERATING EDITORS 


Louis D. COHEN 


University of Florida 
HAROLD A, EDGERTON 


Performance Research, Incorporated 


Max D. ENGELHART 
Chicago City Junior Colleges 
B. B. GREENE 
Chrysler Corporation 
J. Р. GUILFORD 
University of Southern California 
E. F. LINDQUIST 
State University of Тоша 
Frepertc M. Lorp 
Educational Testing Service 
Anprg Lupin 
Walter Reed Army Institute 
of Research 
SAMUEL MEsSIOK 
Educational Testing Service 
WILLIAM В. MICHAEL 


University of California, 
Santa Barbara 


M. W. RICHARDSON 


Richardson, Bellows, Henry and Co. 


Јонх Н. ROHRER 
Georgetown University 
School of Medicine 
P. J. RULON 
Harvard University 


Davin SEGEL 
Indiana University 


С. L. SHARTLE 
Ohio State University 


Н. С. TAYLOR 
The W. E. Upjohn Institute for 
Community Research 


THELMA G. THURSTONE 
University of North Carolina 

HERBERT A. 'ТООР8 
Ohio State University 

E. G. WILLIAMSON 
University of Minnesota 


Brn D. Woop 
Columbia University 


л "Ровотнү ADKINS Woop 
University of North Carolina 


ELT 


ath 


- FO 


[TIONAL AND Рзусногосібліи MEASUREMENT 
XIII, No. 3, 1963 


MEASUREMENT OF SEMANTIC HABITS 


JUM C. NUNNALLY, RONALD L. FLAUGHER, - 
AND WILLIAM F. HODGES 


Vanderbilt University 


IN conjunction with colleagues and students, the senior author has 
t much time during the last several years opening up an €s- 
| lly new area of research which is referred to as the study 
of semantic habits. Nunnally and Flaugher report the theory and 
hodology of the research (1963b) and some empirical correlates 
of semantic habits (1963a). The purpose of this article is to describe 
in some detail the construction of instruments to measure semantic 
habits and the psychometric properties of the instruments. 
ў Our research concerns individual differences in the ways in which 
people form semantic relations. By "semantic relations" we mean 
use of words to denote, define, describe, and depict personal reac- 
ns to “things” in the human and material environmént. For ех-. 
ample, in each of the following phases, the italicized word is set in 
‘semantic relation with an object: (a) “What а hot day!” (b) “She 
is a coed.” (c) “Green snakes are not poisonous.” and (d) “Hand me 
the sharp knife.” Many utterances do not explicitly contain semantic 
Telations: (a) “I am going to lunch.” (b) “What time is it?” and (c) 
"Let/s quit for now.” 
| Our hunch was that there are important individual differences 
in the ways in which people typically form semantic relations, which 
Же call semantic habits. For example, we thought that there might 
individual differences in the tendency to form semantic relations 
With pleasant words.such as good, pretty, and sweet. A number of 
er modes of semantic habits occurred to us. In our research we 
уе (a) defined a number of modes of semantic response, (b) de- 
veloped methods for measuring individual differences in the tend- 


419 


420 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


епсу to use those modes of response, and (c) explored the empirical 
correlates of the measures. 

Our basic assumption is that semantic habits represent different 
modes of frequency of word usage. If that is so, there are many pos- 
sible links between semantic habits and such cognitive variables as 
verbal learning and perception. It is well known that frequency of 
usage of words relates to both verbal learning and perception, more 
frequently occurring words in the language being more easily learned 
and more easily perceived. Also, it may be that semantic habits 
relate to personality variables, It is commonsensical to think that 
differences in the usage of words might relate to differences in 
“needs,” values, and characteristic modes of social interaction. Our 
research results (Nunnally & Flaugher, 1963a) indicate that se- 
mantic habits have small, but consistent correlations with inventory 
measures of personality. 


Measurement of Semantic Habits 


Our major instrument employs binary-choice association items. 
The responses are with respect to four modes of semantic response. 
The first is the tendency to employ words relating to positive evalua- 
tions such as pretty, sweet, and good, which we call the E-plus 
tendency. The second is the tendency to give negative evaluations 
such as ugly, sour, and bad, which we call the E-minus tendency. 
The third is the tendency to respond in terms of observable, or other- 
wise sensuous denotative attributes such as long, sharp, and green 
which we call the D tendeney. The fourth is the tendency to cate- 
gorize objects, which we call the C tendency. That is, rather than 
respond directly to an object with an evaluation or with some 
denotative attribute, the object can be placed in a class, or category, 
such as reptile, coed, Republican, ete. 

The following item contrasts E-plus and D responses: 


Orange: sweet round 
The item 

Orange: sweet fruit 
contrasts E-plus and C responses. The item 

Orange: round fruit 


contrasts D and C responses. 


In no сазе do we contrast E-plus and E-minus responses in the 
same items. Rather, different stimulus words are used in contrasting 


JUM C. NUNNALLY, ET AL. 421 


either E-plus or E-minus with D and C. The item 
Snake: dangerous long 
contrasts E-minus and D. The item 
Snake: dangerous 
contrasts E-minus and C. The item 
Snake: long reptile 
contrasts D and C. 

The total instrument contains five subscales: E-plus versus D, 
E-plus versus C, E-minus versus D, E-minus versus C, and C versus 
D. These are combined into three, experimentally independent 
scales. E-plus versus D and E-plus versus C are combined to form 
one over-all E-plus scale. E-minus versus D and E-minus versus C 
are combined to form one over-all E-minus scale. A separate set of 
items contrasts C and D responses, which we refer to as the C-D 
balance scale. A high score on the C-D scale means that the per- 
son gives many more C responses than D responses. 

All of the 143 items are presented in the Appendix. When the in- 
strument is put to use, items are randomly ordered from the five 
subscales. Presented in this way, subjects apparently are unable to 
verbalize what is being studied. 


reptile 


Development of the Binary-choice Measure 


Construction of items. In the construction of items, no attempt 
Was made to “sample” stimulus words and response words, as one 
Might have done by randomly selecting words from the dictionary. 
Rather, we simply “thought up” stimulus words that might reason- 
ably be expected to elicit two or more of our four types of semantic 
responses. All stimulus words and response words appear to be 
sufficiently simple that word-difficulty should pose no major problem 
for people with a grade-school education. In support of this claim 
We have found reliable results for young adolescents. 

Items were composed by four people and were included in tryout 
forms only if all four agreed on the classifications of response words. 
Although it would have been possible to include the same stimulus 
Word in three subscales (E-plus and E-minus responses not being 
Contrasted in any items), in only two cases does the same stimulus 
Word appear three times in the instrument, and only 26 stimulus 
Words appear twice. 

Reliability. To standardize the instrument and to explore empiri- 


422 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


cal correlates of the instrument, four major and over twenty minor 
investigations were made. Over 3000 subjects participated in the 
studies, including small children, elementary and high-school stu- 
dents, college students, and members of the Armed Forces. 

In our first exploration of semantic habits, about half of the items 
in the Appendix were administered to 70 college students. We wanted 
to see if the proposed scales had any reliability at all. Split-half 
reliability correlations were made for each subscale. Corrected by 
the Spearman-Brown prophecy formula, the reliabilities averaged 
in the .50’s. Although, in comparison to what one would expect for a 
polished instrument, those reliabilities were low, we were happy to 
find even those small measures of internal consistency, and, conse- 
quently, we were motivated to develop better forms. The number of 
items was approximately doubled, and the augmented scales were 
applied in a number of studies. In this article we will rely mainly 
on the results obtained from testing the entire freshman class at 
Vanderbilt (N of 822) in the fall of 1961. 

Table 1 shows the intercorrelations of the five subscales and the 


TABLE 1 

Intercorrelations of Semantic Habits 

(KR-20 Reliabilities in Diagonals) 
MEME TT 0 5 2 
E+/C E+/D E-/C E-/D CD УЕ+ ХЕ- 

Balance 
ОИ рое «X , 

E+/C C77» .60 .63 .42 —.54 

C74» — .55 AT .34 —.46 

E+/D (.66) .28 .29 —.07 

(.62) .20 .18 —.04 

E-/C (.75) .55 —.63 

E-/D (.53) 1 
E -% _ шга 
C-D Balance (.79  —.35 —.56 
EE 5 
(:82) o 
2 RE 2° un 
EE- (.78) 


* Upper values denote correlations for males (N = 
? Lower values denote correlations for females (N D. 


| 


JUM C. NUNNALLY, ET AL. 428 


total E-plus and total E-minus scales. (The reader is reminded that 
in our exploration of the empirical correlates of semantic habits we 
generally confine ourselves to three scales: total E-plus, total E- 
minus, and the C-D balance scale.) In the diagonal spaces of the 
matrix are shown the Kuder-Richardson formula 20 reliability coef- 
ficients (Guilford, 1954) of each scale and subscale. The table shows 
that the three major scales all have reliabilities of approximately 
80. 

One interesting fact shown in Table 1 is the E-plus versus О and 
E-minus versus C have higher reliabilities than E-plus versus D and 
E-minus versus D. The reason for this is that E and D responses 
are “competitive.” That is, people who tend to give many E re- 
sponses also tend to give many D responses on the C-D scale. Conse- 
quently, when E and D responses are directly compared, the vari- 
ance of responses is restricted. 

Correlations among scales. The correlations among the scales are 
all in the anticipated directions, That is, E-plus versus D correlates 
Positively with E-plus versus O, and E-minus versus D correlates 
positively with E-minus versus C, and so on. Ап important point is 
that total E-plus correlates positively with total E-minus (.52 for 
males, .39 for females), the meaning of which will be explored in & 
later section. 

Item analysis. In order to further develop the instrument, we 
computed (a) the proportion of persons choosing each of the two 
alternative responses to each item, and (b) the correlation of each 
item with its respective subscale. This information is shown in the 
Appendix for each item. It should be made clear that the correlations 
are with respect to subscales rather than full scales. That is, the cor- 
Telations listed in the E-plus versus C section of the Appendix show 
relations between the 24 items on that subscale and total scores for 
the 24 items, 

Although the purpose of the item analysis was to either delete or — 
Modify items that failed to “go along” with their intended subscales, 
the results indicate that very little of that needs to be done. The 
Teason is that all 143 items correlate positively with their intended 
Ee even after correction for item-total overlap (Guilford, 

954). 

Norms. Means and standard deviations for the three major scales 

are shown in Table 2. Separate figures are shown for males and fe- 


424 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 2 
Sex Differences in Semantic Habits 
WO a ee oot enn 
Females Males 
(N = 242) (N = 580) 


C-D Balance 
18.50 20.66 
S. D. 5.07 5.94 
"Total E-plus 
z 27.14 25.04 
S. D. 7.25 7.88 
Total E-minus 
X 22.64 21.24 
S.D. 6.88 6.51 


males. As is evidenced in the table, in all of our studies we find 
relatively small but consistent sex differences. Males tend to give 
more C responses, and, in turn, females tend to give more E-plus 
and E-minus responses. 

Response sets. Because of the accumulated evidence which shows 
that many paper-and-pencil measures are strongly influenced by 
response sets, it is important to demonstrate that a new instrument is 
relatively free of such artifactual sources of variance. By its nature, 
our binary-choice form should be free of two types of response sets: 
acquiescence and the tendency to mark extremes. Acquiescence 
should not be present because there is nothing with which the subject 
is asked to agree or disagree. Extremeness should not be present be- 
cause the binary-choice format permits no extreme answers, 48 
would be the case, for example, with multi-point rating scales. Just 
to make sure that individual differences in these two response sets 
do not correlate with our semantic habits scales, correlations were 
obtained with nine measures of acquiescence and four measures of 
the tendency to mark extremes, As the results in Table 3 indicate, 
if there are any “real” correlations between the two response sets 
and our measures of semantic habits, they apparently are small. 

In addition to worrying about other types of response sets, We 
were concerned about the extent to which subjects might be “faking” 
responses to our instrument. In this connection two points are worth 
mentioning. In order for subjects to “fake” responses, they must be 
able to formulate hypotheses about the implications of different 
types of answers, and to do this they must be able to differentially 


JUM C. NUNNALLY, ET AL. 425 


categorize responses within the instrument. Although no thorough 
investigation has been made of “awareness,” informal questioning 
of subjects suggests that they have little idea what is being meas- 
ured. Even fellow psychologists who inspect the instrument for the 
first time do not guess that we are measuring individual differences 
in the tendency to use four different types of associations. This 
probably is due to the random ordering of items within the instru- 
ment. 

If, in any sense, subjects “fake” our instrument, we can think of 
only one way in which that might occur: some subjects might use 
only "nice sounding" response words like pretty, sweet, and valua- 
ble. This would be much like the “social desirability” response set 
that has received so much attention of late. If such a response set 
influenced responses on our instrument, it should force a substantial 
negative correlation between total E-plus and total E-minus scores. 


TABLE 3 
Semantic Habits and Measures of Acquiescence and Extremeness 
E+ Е- CD 
Balance 
ee SS 
Acquiescence 
German Language Recognition Form (N = 70) .07 —.07 06 
(Nunnally & Husek, 1958) 
Agreement Response Scale (N = 70) —.04 2E 0р 
(Couch & Keniston, 1960) 
Social Acquiescence Scale (N = 70) -26 18 н 
(Bass, 1956) 
California F-seale (N = 70) изг 
(Adorno, et al., 1950) 
Reversed California F-scale (N = 70) -26 2 d 
(Christie, et al., 1958) 
F-scale minus Reversed F-scale (N = 70) .08  —.28 9 
(MeGee, 1961) 
Extra-sensory Perception Task (N = 70) .08  —18  -.19 
к (Husek, 1958) 
oreign Language Test (N = 294 
Judgments М ; А .04 E 
Preferences % pir T 
Estremeness 
oreign Language Test (N = 294)» 
Ча дешен ч : } RUNE RTE 
een .00 96 ce 
mantic Differential, “Future” 
(N = 294) ч .08 aT .00 
Semantic Differential, “Me” (N = 294)* -.08 B É 


ee E E A Oe л „ышы к= 


* Discussed in detail in Stuart (1962). 


426 EDUCATIONAL AND PSYCHOLOGICAL MEASUREM MEME К 


In fact, as was mentioned previously, a substantial positive correla 
tion exists between total E-plus and total E-minus scores on our 
strument, which makes it quite doubtful that this form of “во 
desirability" has an influence. 

It is difficult for us to conceive of other response sets that mi 
influence the results of our instrument. Of course, individuals may 
adopt their own idiosyncratic ways of responding to the instrument 
e.g., more frequently mark the first of the two alternative responses; 
but these should serve only to lower the internal consistency, W 
we know to be reasonably high. 


Other Measures of Semantic Habits 


Of course, we would have no further interest in our binary-eh 
measure of semantic habits unless we could show strong corr 
tions with other ways to measure the same attributes. First, 
want to show correlations with other paper-and-pencil measures 
semantic habits, and then we hope to find correlations with real 
indicators of semantic habits. We have accomplished the E 1 
we are beginning to explore the second. 
` Our binary-choice measure has been compared with the re 
obtained from two other methods for measuring semantic habits. | 

. The first alternative instrument is a multiple-choice form in which | 
E-plus, E-minus, C, D, and one other type of response is made 
available for each stimulus word. All stimulus words and over 
per cent of the response alternatives were chosen from the Kent- 
Rosanoff Minnesota Norms (Russell & Jenkins, 1954). For each 
item, response words were chosen such that they had roughly the 
same frequency of association with the stimulus word. Following 8! 
examples of the items: 

Priest: good (E-plus) 
sin (E-minus) 
robe (D) 
minister (C) 
prayer (F) 

Cheese: eat (F) 
good (E-plus) Р 
smell (E-minus) o 
yellow (D) " 
food (C) m 


JUM C. NUNNALLY, ЕТ AL. 427 


Doctor: help (F) 

man (C) 

health (E-plus) 

sickness (E-minus) 

white (D) 
Instructions for subjects were very much the same as those used in 
the classical method of association, except in this case subjects were 
limited to marking one of five alternative responses, The response 
words above marked “F” relate to “functions,” which will be dis- 
cussed in a later section. ‚ 

The second alternative method for measuring semantic habits was 
a free-response form. Every other stimulus word was taken from 
the subscales of the binary-choice form. Subjects were asked to 
complete a sentence with each stimulus word as follows: 

A baseball is 

A. ghost is 

Coal is 

An ant is 
Subjects were told that they could either respond with one word only 
or could preface the final word with “a,” “an,” or “the.” This type of 
item format was used in preference to that employed in the classical 
method of association to induce subjects to give semantic responses, 
i.e., those that serve to define, denote, and describe, rather than the 
syntactic responses typically found with the classical method of as- 
sociation. (For a further discussion of the important difference be- 
tween semantic and syntactic association see Nunnally & Flaugher, 
1963b.) The free-response form is content-analyzed’ with respect 
to E-plus, E-minus, C, D, and other types of responses. 

The three methods of measuring semantic habits discussed above 
were administered to a group of 95 college students.? Table 4 shows 
the correlations of the binary-choice form with the other two forms, 
separately for males and females. Because the two alternative meth- 
—— 

1In the study of the free-response form discussed in this article, content 
analyses were made by two coders working independently. Correlations for the 
codings of the two workers for our four primary categories of semantic habits 
averaged over 90. Generally, we find quite high agreement between coders in 
content analyses of semantic habits exhibited in various types of free-response 
written and oral productions. р S 

? For this study, instead of the 143-item form, a shortened version (71 items) 


of the binary-choice measure was used. The remaining 72 stimulus words were 
used to comprise the free-response form. 


428 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 4 


Correlations of the Corresponding Scales in Three 
Methods of Measuring Semantic Habits 


ġġ mam m‏ کے 


Binary-choice Form with Binary-choice Form with | 

Multiple-choice Form Free-response Form 

Males Females Males Females 

(N = 62) (N = 33) (N = 62) (N = 38) 
E-plus .52 51 .58 43 
E-minus .44 .30 .47 61 
.72 .55 .49 63 
D —.57 —.30 —.96 —.47 


ods had separate measures for C and D, in each comparison two 
correlations are shown between the C-D scale from the binary- 
choice form and corresponding scales on the other two instruments. 

АП of the correlations are in the expected directions. The average 
correlation between corresponding scales is .50.° Several points should 
be kept in mind in looking at the correlations. First, the binary- 
choice form was shortened by one-half. Consequently, the reliabili- 
ties for the scales would be considerably lower than those shown in 
Table 1. Second, less than one-third of the stimulus words were the 
same on the binary-choice and free-response forms, and just two of 
the stimulus words from the multiple-choice form appeared in either 


of the other two tests. Third, the two alternative racthods of measur- 


ing semantic habits were developed especially for this study, and 
no previous tryout research had been undertaken to refine them. 
Considering these points, we view the correlations in Table 4 as con- 
vineing evidence of the generality of semantic habits across alterna- 
tive methods of measurement, 


Other Semantic Habits 


Of course, there may be many other semantic habits, perhaps ones 


that are more important than those that we are currently studying. — 


One type of semantic habit was forced on us by a study of nursery- 
school children, Rather than evaluate stimulus objects, or give C or 
D responses, they predominantly responded in terms of “functions,” 


2 Of course, since the C and D scales are opposed in the binary-choice form, 


a low “C-D balance" score represents a high D score, so C-D balance correlates 


negatively with D scores of the other measures. This sign was changed in com- 


puting the average correlation. 


La 


JUM C. NUNNALLY, ET AL. ~ 49 


that is, with what “it” does or what you do with “it.” Examples are 
Mother—cooks, Dog—bark, Knife—hurt, and Spoon—eat. 

There may be numerous forms of “relational” semantic habits in 
which, rather than responding to the properties of the stimulus ob- 
ject, the subject responds in terms of related objects or ideas. Such 
relational responses might be represented in terms of temporal, 
spatial, and functional relations. Examples of the three are, respec- 
tively, June—July, Chair—table, and Doctor—nurse. Further rela- 
tional semantic habits may exist in terms of the tendency to give 
"similarities" rather than “opposites,” respective examples being 
Iliness—sickness and Black—white. The number of potentially 
measurable and potentially useful semantie habits may be quite 
large. We are presently developing measures of several new types of 
semantic habits. 


Conclusions 


Our major accomplishment to date has been the development and 
refinement of a binary-choice measure of semantic habits. Our re- 
sults show (a) the three major scales on the instrument have reason- 
ably high internal consistency, (b) correlations among subscales are 
in the expected direction, and (c) the scales on the binary-choice 
measure correlate well with two alternative methods of measure- 
ment, Other results (Nunnally & Flaugher, 1963a) show that the 
instrument distinguishes between some different types of people, and 
that the scales have small but consistent correlations with per- 
sonality inventories. 

An important type of information about our binary-choice form 
still is lacking. So far we have conducted no studies of the temporal 
stability of the scales, A large scale follow-up study will be made 
during the next several months of students who were tested fourteen 
months previously. 

Now that we have developed measures of semantic habits, are they 
useful for any purpose? If our working assumption is correct that 
semantic habits represent different modes of frequency of usage, 
many hypotheses can be generated linking semantic habits with 
cognitive and affective processes (see Nunnally & Flaugher, 1963b). 
For example, the hypothesis follows that visual duration thresholds 
for “pleasant” and "unpleasant" words should interact with the 
E-plus and E-minus scales on our instrument. Many more such 


430 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT - 


hypotheses can be generated. Readers are invited to join us in this 
challenging new area of research. 


Summary 


The purpose of the article was to describe the construction and 
refinement of measures of new sources of individual differences 
which we refer to as semantic habits. By semantic habits are meant 
individual differences in the use of different, modes of semantic re- 
sponse to objects in the human and material environment. For ex- 
ample, one of our scales measures the tendency to give “pleasant” 
response words such as good, pretty, and sweet. 

Our major instrument for measuring semantic habits employs 
binary-choice association items, e.g., Orange: sweet —— 
fruit. The alternatives are structured in such a way as to measure 
three major types of semantic habits. Our results indicate that the 
‘scales have a relatively high degree of internal consistency and that 
they correlate well with two alternative methods for measuring 
semantic habits. Other psychometric properties of the instrument 
were reported. 

Our working assumption is that semantic habits represent dif- 
ferent modes of frequency of usage. If that is so, many hypotheses 
follow about relations between semantic habits and verbal learning, 
verbal performance, perception, and personality. 


REFERENCES 


еа, ка Bronsiik, Else, Levinson, D. J., and Sanford, 
. N. The Au ; : E 
Brothers, 1950. oritarian Personality. New York: Harper an 
Bass, B. М. "Development and Evaluation of a Scale for Measuring 
cial Acquiescence,” Journal of Abnormal and Social Psy- 
chology, LITI (1956), 296-299, 
ae Pir da rp (ong Seidenberg, B. “Is the F-scale b 
ournal o bnorm А 
(1958), 143-159. al and Social Psychology, 
Couch, A. and Keniston, К. “Yeasayers Mid Niyssyers: Agit 
Response Set, as а Personality Variable." Journal of Abnorm 
and Social Psychology, LX (1960), 151—174. 
чоч, J. P. Psychometric Methods, New York: McGraw-Hill, 
Husek, T. R. “Acquiescence as a Factor in Test-taking Behavior 
and as a Personality Characteristic.” Unpublishe 1 PLD. thesis, 
University of Illinois, 1958, 


McGee, R. K. "The Relationship Between Response Style and Per- 


D———— ——  ———"—————— 


JUM C. NUNNALLY, ET AL. 431 


sonality Variables: Acquiescence and Social Orientation." Un- 
published Ph.D. thesis, Vanderbilt University, 1961. i 

Nunnally, J. C. and Flaugher, R. L. “Correlates of Semantic Hab- 
its." Journal of Personality, ХХХІ (1963), 192-202. (а) 

Nunally, J. C. and Flaugher, В. L. “Psychological Implications of 
Word Usage." Science, CXL (1963), 775-781. (b) 

Nunnally, J. C. and Husek, T. R. “The Phony Language Examina- 
tion: An Approach to the Measurement of Response Bias.” Educa- 
tional and Psychological Measurement, XVIII (1958), 275-282. 

Russell, W. A. and Jenkins, J. J. “The Complete Minnesota Norms 
for Responses to 100 Words from the Kent-Rosanoff Word As- 
sociation Test.” Technical Report, N8-onr-66216, University of 
Minnesota, 1954, 

Stuart, Jane L. “Intercorrelations of Depressive Tendencies, Time 
Perspective, and Cognitive Style Variables.” Unpublished Ph.D. 
thesis, Vanderbilt University, 1962. 


Appendix 
Following are listed the stimulus words and response alternatives 
for the five subscales of the binary-choice measure of semantic 
habits. Also shown are (a) the correlation of each item with its cor- 


responding subscale, and (b) the per cent of subjects (N = 822) 
selecting each of the two alternative choices for each item. 


E-plus versus C Subscale | 

Stimulus word E-plus alternative C alternative Ttem-subscale 
: (per cent) (per cent) correlation 
football: 61 exciting 39 game .55 
steak: 53 tasty 47 meat .52 
peacock: 66 beautiful 34 bird .50 
eagle: 42 brave 58 bird .50 
гасег: 53 thrilling 47 car .48 
chair: 40 comfortable 60 furniture 47 
ballerina: 38 graceful 62 dancer 46 
silk: 44 beautiful 56 cloth 46 
orchid: 37 pretty 63 flower 44 
kitten: 58 friendly 42 animal 42 
move 42 enjoyable 58 entertainment .42 
aspirin: 52 helpful 48 pill E 
chocolate: 56 sweet 44 candy E! 
apple: 14 sweet 86 fruit .38 
skiing: 60 thrilling 40 sport '38 
Ting: 18 pretty 82 jewelry .35 
magician: 42 fascinating 58 entertainer .34 
Me: 14 useful 86 tool `32 
mink: 54 expensive 46 fur .30 
clown: 75 amusing 25 actor 28 
diamond: 45 expensive 55 jewel 27 
TY 14 pretty 86 vine 26 
sleep: 42 soothing 58 rest. 125 
dictionary: 18 valuable 82 book 2 


432 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


^ Stimulus word 


E-plus versus D Subscale 

E-plus alternative D alternative 

(per cent) (per cent) 
48 satisfying 52 wet 
31 pleasant 69 green 
72 fun 28 wet 
66 pretty 34 round 
39 thrilling 61 fast 
42 delicious 58 cold 
59 soothing 41 quiet 
35 pleasant 65 warm 
51 fun 49 round 
35 lovely 65 distant 
18 exciting 82 fast 
44 useful 56 small 
53 fragrant 47 red 
14 pretty 86 green 
63 inspiring 37 colorful 
63 cute 37 small 
40 sweet 60 red 
54 refreshing 46 clean 
34 good 66 juicy 
74 comfortable 26 wooden 
51 peaceful 49 sheltering 
50 beautiful 50 shiny 
56 fun 44 rhythmic 
57 expensive 43 hard 
23 friendly 77 furry 
74 sweet 26 dark 
44 restful 56 soft 
28 gentle 72 woolly 
72 tasty 28 nourishing 
90 valuable 10 round 
48 brave 52 strong 


E-minus versus C Subscale 


E-minus alternative 


(per cent) 


37 dangerous 
26 frightening 
35 repulsive 
55 horrible 
52 dirty 

26 cruel 

70 stinking 
34 dangerous 
26 burning 
27 uncomfortable 
19 repulsive 
60 scary 

42 annoying 
77 sour 

41 sneaky 

79 sad 

21 alarming 


C alternative 
(per cent) 


63 firearm 
74 animal 
65 bird 

45 crime 
48 fuel 

74 weapon 
30 animal 
66 chemical 
74 medicine 
73 disease 
81 reptile 
40 dream 
58 insect 
23 fruit 

59 agent 
21 ceremony 
79 crowd 


Item-subscale 
correlation 


Item-subscale 
correlation 


black: 
knife: 
anger: 
ambush: 
sobbing: 
siren: 
cannibal: 
wreck: 
dice: 


Stimulus word 


knife: 
cliff: 
snake: 
ghost: 
dust: 
fly: 
bear: 
rain: 
lunatic: 
coal: 
lion: 
alarm clock: 
drunk: 
spy: 
dictator: 
crime: 
hunger: 
coward: 
noise: 
mud: 
fever: 
pepper: 
garbage: 


Stimulus word 


hammer: 
gun: 

silk: 
butter: 
jet: 

axe: 
noise: 
mittens: 
crow: 
marble: 
diamond: 
lamb: 
sea: 
black: 
cat: 


JUM C. NUNNALLY, ЕТ AL. 


38 dreary 62 color 
35 dangerous 65 weapon 
26 unpleasant 74 emotion 
12 deadly 88 trap 
32 unhappy 68 crying 
76 alarm 24 signal 
9 awful 91 savage 
10 unsafe 90 accident 
8 risky 92 gambling 
E-minus versus D Subscale 
E-minus alternative D alternative 
(per cent) (per cent) 
69 dangerous 31 thin 
41 dangerous 59 high 
51 ugly 49 long 
44 scary 56 white 
51 unpleasant 49 dry 
87 annoying 13 small 
47 mean 53 furry 
31 dreary 69 wet 
45 dangerous 55 sick 
46 dirty 54 black 
64 ferocious 36 strong 
61 disturbing 39 punctual 
38 disgusting 62 intoxicated 
44 sneaky 56 secretive 
39 unfair 61 powerful 
21 horrible 79 illegal 
42 unpleasant 58 empty 
59 worthless 41 uncertain 
35 unpleasant 65 loud 
92 messy 8 wet 
40 uncomfortable 60 hot 
80 hot 20 black 
38 foul 62 rotten 
C versus D Subscale 
C alternative D alternative 
(per cent) (per cent) 
46 hard 54 instrument 
34 loud 66 firearm 
56 soft 44 material 
58 soft 42 food 
62 fast 38 airplane 
35 sharp 65 tool 
57 loud 43 sound 
67 warm 33 gloves 
41 black 59 bird 
67 smooth 33 rock 
56 brilliant 44 jewel 
65 woolly 35 sheep 
52 deep 48 ocean 
62 dark 38 color 
47 furry 53 animal 


usBEBEBBE 


Item-subscale 


correlation 


г 
Ё 
ő 
3 
: 
: 


BU 


гр 
Н H 


"RI 


i 


848488R-SPSPNTE*5EZJES 


i 
OGT 


"ш 


3È 


1 


85 
Ч 


PRERBBESEEBBEBBBBRRRR 


EDUCATIONAL AWD PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 3, 1963 


A FACTOR ANALYTIC STUDY OF ACQUIESCENT AND 
EXTREME RESPONSE SET! 


RICHARD E. SCHUTZ 
Arizona State University 
AND 
ROBERT J. FOSTER 


The George Washington University 


IN recent years, there has been considerable interest in treating 
the response set component of test scores, not as error variance, but 
as an expression of a personal stylistie variable (Couch & Keniston, 
1960; Jackson & Messick, 1958; McGee, 1962a). For the most part, 
studies which have sought to demonstrate a relationship between the 
response set and an underlying dimension of personality have used 
only a single test or instrument to elict the response set. Chief in- 
terest has been centered around the tendency to respond affirmatively 
{о test items, commonly referred to as acquiescent response get 
(ARS). Consequently several procedures purporting to measure 
ARS are available. However, more recent correlational research 
(Foster, 1961; Hanley, 1959; McGee, 19622) indicates that these 
measures do not tap a homogeneous response class, suggesting that 
ARS probably should not be considered as a generalized tendency. 

A factor analysis of social desirability, defensiveness, lie, and 
acquiescence scales performed by Bendig (1962) yielded five oblique 
factors with the two ARS scales loading only one factor, labeled 
“test-taking acquiescence.” However, since the two ARS measures 
included in this analysis were both subscales of a single scale, the 
Social Acquiescence Scale (Bass, 1956), the parallel factor pattern 


+The data analysis aspects of this study were supported by the National 
Institute of Mental Health under Contract М-2381. The assistance of Austin E. 
Grigg, University of Texas, in gathering part of the data is gratefully acknowl- 


436 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


for the two variables is to be expected. Although these findings in- 
dicate that ARS measures have a different factor pattern from other 
response set measures, the results shed no light on the factorial 
structure of ARS per se. 

The present study was designed to investigate the factorial struc- 
ture of a larger number of response set measures when factored wi 
several “marker” variables. Of particular concern was the structure 
obtained using “strongly agree” (SA), “moderately agree" (MA), 
“acquiescent response set” (SA + MA), and "extreme response set” 
(strongly agree + strongly disagree) scoring procedures. Since these 
measures are all ipsative, four separate analyses were performed. 
The type of response set scores included in each analysis differed, 
but all other variables remained the same. Since the marker vari- 
ables are the same in each analysis, differences in the factor patterns 
can be attributed to differences in the factorial composition of the 
response set scores. 


Method 
Response Set Instruments 


1. The Aphorisms Questionnaire, 72 aphorisms including many 

found in the Bass (1956) scale. Items as a whole are highly - 

heterogeneous in content, although all concern human behavior. 

The seale was constructed so that ARS content would be coun- 

ter-balanced for social desirability. Response alternatives are 

“strongly agree,” “moderately agree,” “moderately disagree,” 

and “strongly disagree.” 

The Perceptual Reaction Test (PRT). A measure developed by 

1. A. Berg, W. A. Hunt, and E. Н. Barnes (Berg, 1953). The 

test includes 60 abstract designs. Response alternatives are “like 

much,” “like slightly,” “dislike slightly,” and “dislike much.” 

3, Activity Scale. 50 phrases describing a wide range of activities 
and interests in which college students are likely to engage: - 
Items likely to elicit strong social desirability bias were excluded. 
Response alternatives are “like strongly,” “like moderately,” 
“indifferent,” “dislike moderately,” and “dislike strongly.” 

4. Information-True Test. A slight modification of the measure 
developed by N. L. Gage, С. 8. Leavitt, and G. C. Stone (1957). 
The test includes 46 extremely difficult true-false items found to 
elicit a fifty-fifty split of “true” and “false” responses. These 


№ 


SCHUTZ AND FOSTER 437 


are intermixed with 32 easier buffer items to make the task ap- 
pear realistic. 


Self-report Marker Variables 


Subjects were asked to choose which of two adjective phrases best 
describe them. The forced-choices were read by the experimenter, 
and subjects marked A or B on answer sheets. 

5. Friendly vs. ambitious 
6. Critical vs. easy going 
7. Need to be around people vs. enjoy being alone frequently 
8. Accurate vs. careless 
9. Less optimistic than average vs. more optimistie than average 
10. Like to work with people vs. like to work with things 
11. Likable and agreeable vs. argumentative and firm 
12. Helpful vs. independent 
18. Go-getter vs. easy going 
14. More studious than average vs. less studious than average 
15. Talkative vs. silent 
16. Independent vs. social 


Other Variables 


17. Number of siblings 

18. California F Scale (F+). This test was composed of 25 of the 
original items which were used in a study by Christie, Havel, 
and Seidenburg (1958). 

19. Compliance—Essay Writing. As subjects were leaving the test- 
ing session, they were handed a sheet of paper requesting that 
they write a 50 to 100 word essay on “aspects of life which are 
most important to me" and return it to а conveniently located 
box within three days. 'Those who turned in an essay received 
a score of 1, those who did not received a score of 0. 

20. Compliance—V olunteer. The response to a request made by the 
experimenter at the beginning of the ARS testing session. Sub- 
jects were asked to volunteer to do another 10 to 15 minute task 
in about three weeks to help out the experimenter and a co- 
worker. To indicate their willingness to help, subjects were in- 
structed to write “yes” by their name on a card (scored 1) or 
to write “по” (scored 0) if they did not wish to participate. А 
request to pass the cards forward quickly prevented questions. 


438 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


21. Social desirability response set. The Aphorisms Questionnaire 
contained 24 items judged to be socially undesirable and 24 
judged to be socially desirable. The score on this variable was 
the number of items a subject marked in а socially desirable 
direction minus the number of items marked in a socially un- 
desirable direction. 

22. "Single alternative" response set. The frequency on the Apho- 
risms Questionnaire with which a subject marked the response 
category that he selected most frequently. A high score on this 
variable indicates a tendency to concentrate responses in & 
single response category; a low score indicates a tendency to 
distribute responses among the categories. This set would ap- 
pear to reflect aspects of rigidity or inflexibility in restricting 
the variability of one’s responses. 

23. University of Texas Entrance Examination—Verbal score 

24. University of Texas Entrance Examination—Numerical score 


Subjects 


Subjects were 150 freshmen, 75 males and 75 females, who were 
selected from introductory psychology courses on the basis of age 
(17-19 years) and their being available at the time testing was 
scheduled. Subjects participated to meet course requirements. 


The Analysis 


The 24 scores for each subject were punched into IBM cards. A 
product-moment intercorrelation matrix was prepared and a princi- 
pal components analysis was performed. Components with eigen- 
values greater than unity were rotated to orthogonal simple struc- 
ture using normalized varimax procedures, All statistical 
computations were performed using an IBM 7090 computer. 


Results 
Each of the four analyses yielded nine rotated factors. However, 


?These computations were performed on the IBM 7090 at the Western 
rae о Califo S. еше School of Business Administration, Uni 
rnia, . Pi i 

M NL the analysis, es. We wish to thank David J. Oatey, WDPC, 

ine tables reporting the intercorrelati, means, standard deviations, 
factor matrix and rotated factor matrix for дды analysis have been deposi 

with American Documentation Institute. Order Document No. 7561 from 

ADI Auxiliary Publications Project, Photoduplication Service, Library of 


for several factors in each analysis only the marker variables had 
loadings of .30. These factors will not be discussed here. All factors 
which include a response set measure with a loading above .30 are 
described below. Tables 1—4 present the scales and loadings for each 
of the factors. 


TABLE 1 
ARS (Acquiescent Response Set) Analysis 


Factor 2—Academic "Interest!" 
.75 More studious than average 
.71 High Activity ARS 
.41 High PRT ARS 
.38 Accurate vs. careless 
.80 More optimistic than average 


Factor 3—"'Inflexibility'" 
.84 High Single Alternative Response Bet 
.78 High Aphorisms ARS 
.36 High Information True ARS 
.84 More optimistic than average 


Factor 4—Verbal Ability 
.82 High Verbal Test score 
.75 Low F+ score 
.46 High Numerical Test score 
.45 Low Information True ARS 
.89 Critical vs. easy-going 


Factor 5—Compliance 
.62 Compliant—Volunteer 
.55 Critical vs. easy-going _ 
.43 Compliant—Essay writing 
.83 Helpful vs. independent 


Factor 6—Cautiousness 
.69 High socially desirable response set 
.68 Accurate vs. careless 
.49 Low PRT 
.37 Less optimistic than average 


Factor 7—Social Introversion 
.72 Independent vs. sociable 
.71 Silent vs. talkative __ 
.56 Compliant—Essay writing 
.51 Enjoy being alone frequently rf 
„37 Low Information True ARS : 
.34 Ambitious vs. friendly 


м шш E y‏ ر ی 

Congress, Washington 25, D. C, remitting in ad $1.75 for microfilm 
4 , D. C, g in vance 9l. croit 

gt 8250 for photocopies. Make checks payable to: Chief, Photoduplication 
» Library of Congress. A TOÀ 


440 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


To facilitate interpretation, all loadings are reported as absolute 
values. The verbal description of the scale indicates whether high 
or low performance is associated with the factor. 


ARS Analysis. Factor 2 has been labeled “Academic "Interest." A 
person earning a high factor score here represents himself as a 
studious individual who enjoys a number of college-related ac- 
tivities. He also “likes” abstraet artistic designs. Since self-report 
measures are involved here, it is possible that this “interest” is for 
some individuals a facade, For this reason the word, interest, has 
been bracketed in quotation marks, 

Factor 3 appears to reflect a simple-minded kind of “inflexibility.” 
Tt should be noted, however, that the inflexibility measure is a re- 
Sponse set type of score based on the Aphorisms Questionnaire 
rather than a general measure of flexibility. It appears that this 
factor may be tapping a tendency to respond unthinkingly to the 
test situation, allowing the Tesponse set, whatever it may be, to be 
the dominant determinant. 

Factor 4 is a clear-cut verbal ability factor. The high negative 
loading of the F +- scale is of incidental interest, supporting the 
contention that intellectual competence contributes significantly to 
the factorial components of F scale performance. 

Factor 5 is quite clearly a behavioral compliance factor. The 
loading of the “critical” self-report item is more understandable 
when the alternate choice of “easy going” is also considered. It is 
likely that a Person who would score high on this factor would select 
“critical” simply because he would not label himself “easy going.” 

Conscientious” might have been a better original choice. 
: Factor 6 has tentatively been labeled “Cautiousness.” This is 


ceptual Reaction Test (PRT) loading is not immediately obvious. 
However, a number of subjects incidentally indicated a belief that 
the PRT involved unconscious sex or hostile symbolism. If this in- 


reflect cautiousness toward admitting a preference for symbolism 
with socially unacceptable content, 
Factor 7 is titled "Social Introversion.” The essay writing com- 


pliance and Information True ARS loadings appear understandable 
in this context, 


SA Analysis, As might be expected, the SA response set, measures 


SCHUTZ AND FOSTER 441 


TABLE 2 
SA (Strongly Agree) Analysis 


Factor 2—Inflexibility 
.72 High PRT SA 
.67 High Single Alternative Response Set 
.65 High Aphorisms SA 
.43 High Activity SA 
.40 More optimistic than average 
.40 High Information True ARS 
.32 More studious than average 


Factor 3—Verbal Ability 
.80 High verbal score 
.75 Low F+ 
.52 Low Aphorisms SA 
.45 Critical vs. easy-going 
.43 High numerical score 
.41 Low Information True ARS 


Factor 6—Social Introversion 
.68 Independent vs. sociable 
.66 Compliant—Essay writing 
.65 Silent vs. talkative 
.43 Enjoy being alone frequently 
.35 Low Information True 


- proved to be less complex factorially than the ARS measures, load- 

"ing only two factors. The Inflexibility factor appears to parallel Fac- 

tor 3 in the ARS analysis. However, all of the paper-and-pencil ARS 

- measures have sizable loadings on this factor in the SA analysis. 

SS A t¢ 11у?” i ARS 
- Factor 3 also parallels the “Verbal Ability factor in the 

- analysis. It does, however, pick up a negative loading on the Apho- 

‘Tisms Questionnaire, suggesting that the brighter subject tends to 

- question the complete infallibility of any aphorism statement. 

The “Social Introversion” factor appears to be a direct counter- 


- Part to the factor assigned this label in the ARS analysis. 


\ МА Analysis. The MA measures appear on more factors than any 
.. of the other three types of response set measures analyzed. If Factor 
|_1 were reflected, it would closely resemble the Social Introversion 
CR factors in the other analyses. Factor 2 is an untitled puzzle. Several 
. facetious interpretations readily come to mind. However, none of 
these has any real plausibility. 

Factor 3 contains the same variables as the factors titled “In- 
_ flexibility" in the preceding analyses. Note, however, that on this 
factor the direction of the “flexibility” loading is just opposite that 


442 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


BEBRERE 


CETTE 


EBSE 


авва 


.31 Critical vs, easy-going 
ee) ав 
found in the preceding analyses of SA responses. It appears that 


"moderately agree" 
one's responses over 


` Ambitious vs. friendly 


TABLE 3 
MA (Moderately Agree) Analysis 


Talkative vs. silent : 
High social desirability response set 


& 
Factor 2—(Untitled) | 
High Activity MA 
Many siblings 
High Numerical score 
Need to be around people 
Silent vs. talkative 


Factor 3— Flexibility 
High Aphorisms MA 
Low Single Alternative Response Set 
High PRT MA 
Low social desirability response set 


Factor 4—Achievement Need 
Go-getter vs. easy-going 
High PRT MA 
Low Information True 


Factor 5—Verbal Ability 
High Verbal score 
Low F+ 
Critical vs. easy-going 
High numerical score 
Low Information True ARS 


Factor 7—Social Introversion 
Independent vs, sociable 
Silent vs, talkative 
Compliant—Essay writing 
Enjoy being alone frequently 

w Information True ARS 


Factor 9—Compliance 
Compliant—Volunteer 
Compliant—Essay writing 


response set is related to a tendency to spread 
several response categories, both on the Apho- 


] SCHUTZ AND FOSTER 40 dg 4% 
ims Questionnaire, from which the flexibility measure was derived, 
don the PRT. 

Factor 4 has no parallel in the analyses previously discussed. The 
wo marker variables clearly reflect an achievement concern. Factors 
5, 7, and 9 are much the same as their counterparts in preceding 


Analysis. When extreme response set measures are analyzed, 
е verbal ability factor shades into authoritarianism. The same 
jables are involved in this factor as in SA, but their order is 
shifted. It should also be noted that the direction of the Information 
True ARS and the “easy-going vs. critical” self report are reversed. 
s would appear to indicate that a degree of “gullibility” is as- 
ted with authoritarianism and with having strong opinions 


TABLE 4 
ERS (Extreme Response Set) Analysts 


Factor 1—Authoritarian 
.80 High F+ 
.79 High Aphorisms ERS 
.58 Low Verbal test score 
.56 High Information True ARS 
.37 Easy-going vs. critical ° 


Factor 2—Social Introversion 
.72 Independent vs. sociable 
.70 Silent vs. talkative — 
.59 Compliant—Essay writing 
.46 Enjoy being alone frequently 
'35 Low Information True ARS 
.32 Ambitious vs. friendly 


Factor 3—Inflexibility 
.80 High Single Alternative Response Set 
‘62 High PRT ERS 
.57 High Activity ERS 
.33 More optimistic than average 


Factor 5—(Untitled) 
.69 High Numerical Test score 
.55 Independent vs. helpful 
.47 Low Activity ERS 
.45 Many siblings 
.86 More optimistic than average 


TT Factor 9—Compliance 

| .80 Compliant—Volunteer 
.52 Critical vs. easy-going I 
.40 Compliant—Essay writing 


к... 1 


444 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


(agreeing or disagreeing) about aphorism statements. The remain- 
ing factors each parallel factors previously discussed. 


Discussion 


Considering each response set measure individually, ARS as 

` derived from the Information-True test appears to be the most 

complex factorially, with sizable loadings on several factors in each 

analysis. The Aphorisms Questionnaire and the Activity Scale load 

no more than two factors in any analysis, indicating rather high 
factorial purity. 

While some of the factors include more than one response set in- 
strument, the over-all factor patterns for the several instruments 
differ within each analysis, Thus, although the ARS measures are 
Positively correlated, the findings suggest that they do not tap a 
highly generalized response tendency. These results are supported 
by previous findings, suggesting that studies which use only one 
Measure of given response set should not generalize beyond the 
particular eliciting instrument, Likewise, the hypothesis that a given 
response set, per se, may reflect a stylistic component of personality 
appears to be oversimplified. 

Moreover, the nature of the patterns between analysis changes 
for each instrument, indicating that scoring procedures affect the 
nature of the factor structure. Consequently, it appears that the 
Tesponse set component of a Likert-like response continuum is rather 
complex. The data Suggest, for example, that MA and SA response 
tendencies are partially dissimilar or, perhaps, even opposite in their 
underlying “dynamics.” A similar conclusion is suggested by the 
data of Couch and Keniston (1960, p. 153), which show a greater 
relationship between their over-all ARS measure and “agree” re- 
sponses than “strongly agree” responses. 

If this conclusion is correct, studies combining possible response 
categories into one response set score may confound the results of 
studies which seek to demonstrate a relationship between response 
style and personality, Moreover, the findings of this study might 
have been different had a six-alternative scale rather than a four- 
alternative scale been used. Until more evidence is available, re- 
Sponse categories should be scored and analyzed separately. 

The factor structure of the ARS instruments is summarized in 
Table 5. At some points the factorial structure is consistent with 


e 
А a s f. 

=; 

T потыәледхя [BOOS 
|. womioAVHX;[ [оо Gg рәәм ўчәшәләгрәү мот 
/. wsprenuejuoqony 99 Худу [919A мот 
f 

8 wenwuvyuoqny 62" AKYOL 
8 pennun ZF’ 

E. Aypqmogu] 29° ponnun 
2 Aymqumog 
2 Хипатхә 
Б Хұатхәрир c9. pooN эпәшәләцоу 
О 

nm 


ce 
c 
6r 


0g 
99° 


ce 
ep 


empoooiq 3011025 


uopioA€dxq [005 
Aymquogur 
ANY [919A мот 


Худу T9019A мот 
Aympqixogur 


Кұпатхәрчт 


Жуцідтхәрит 


Кұттхәрш 
UOISIIAVIJX H 121208 
Aggy PQA мот 


Аупархәро 
«Аѕәләдиү,, әтшәрвоү 


,Ase1ejum,, огшәрвәү 
SS9USnOT)n/) “oT 


зуџәштајғиу 395 asuodsag fo SULIDA 407A] 
9 WISVL 


9g 

© 

sy эвәд, oni[-ojuT 
элип 

817 | -uonsent) suspoydy 

wu arog Ayanoy 

т 3921 

6p | uonowey үєпүйәәләд 

juoumujsug 


446 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


‘previous correlational findings in the literature (Foster, 1960; Me- ; 
Gee 1962a) ; at other points it is inconsistent. For example, the high \ 
ARS scorer has been found to be outgoing, uncritical, unquestioning 
and insensitive or “Babbitlike,” findings consistent with the loadings 
of some of the instruments on Factors 3, 6, and 7 in the ARS analy- $ 
sis and Factor 1 in the MA analysis. However, the failure of any 
ARS measure to load on the Compliance factor and the direction of 
the PRT loading on Need Achievement in the MA analysis is in- 
consistent with other correlational findings. 

The loading on the Authoritarian and Inflexibility factors in the 
ERS analysis supports the contention that the authoritarian per- 
sonality holds strong, rigid opinions, but it is inconsistent with the A 
position that ARS and the content score of the California F scale ¢ 
Were fortunately confounded because they both reflect the same . 
acquiescent tendencies as suggested by Gage, Leavitt, and Stone Ё 
(1957) and Leavitt, Hax, and Roche (1955). | 


Summary 


The study was designed to investigate the factorial structure of a _ 
number of test response set measures when factored with several _ 
"marker" variables. Scores of four response set instruments and ( 
twenty marker measures were obtained for 150 college students. . 
Four separate analyses were performed, varying the scoring pro- E 
cedures for measuring the response set and using the same marker | 
variables. The factor patterns of the various response set instru- 
ments differed within each analysis. The nature of the patterns also 
changed between analyses. The findings substantiate the results of 
previous correlational studies that acquiescent response set and ex- | 
treme response set do not reflect generalized response tendencies ' 
and that strong agree set and moderate agree set on a given test, j 
rather than both measuring the same dimension, are probably quite 
different in their underlying meanings. 


REFERENCES 


Bass, B. M. “Development and Evaluation of a Scale for Measuring 
Social Acquiescence,” Journal of Abnormal and Social Psy- 
chology, LITI ( 1956), 296-299 ‹ 

Bendig, А. W. “A Factor Analysis of ‘Social Desirability,’ ‘Defen- 
siveness, ‘Lie,’ and ‘Acquiescence’ Scales.” Journal of General 
Psychology, LXVI (1962), 129-136, 


SCHUTZ AND FOSTER * adii. 447 


1 ! Berg, I. A. “Тһе Reliability of Extreme Position Response Sets in 
` Two Tests." Journal of Psychology, XXXVI (1953), 3-9. 
- Christie, R., Havel, Joan, and Seidenburg, B. “Is the F Scale 
reversible?” Journal of Abnormal and Social Psychology, LVI 
r (1958), 143-159. 
` Couch, А. and Keniston, K. “Yeasayers and Naysayers: А 
Response Set as а Personality Variable." Journal of Ab 
and Social Psychology, LX. (1960), 151-174. 
Foster, R. J. "The Construct Validity of Acquiescent Response Set 
= asa Measure of Acquiescence.” Unpublished Ph.D. thesis, 1960, 
University Microfilms, Inc. No. 60-6617. d 
Foster, В. J. “Acquiescent Response Set as a Measure of Acquies- 
сепсе.” Journal of Abnormal and Social Psychology, 
(1961), 155-160. 2 

Gage, N. L., Leavitt, G. S., and Stone, G. C. “The Psychological 
Meaning of Acquiescent Set of Authoritarianism." Journal of 
Abnormal and Social Psychology, LV (1957), 98-103. 

L7. Hanley, C. “Response to the Wording of Personality Test Items.” 

* Journal of Consulting Psychology, XXIII (1959) , 261-265. 
Jackson, D. N. and Messick, 8. J. “Content and Style in Personali 

1 Assessment." Psychological Bulletin, LV (1958) , 243-258. 

| Leavitt, H. J., Hax, H., and Roche, J. H. “Authoritarianism and 

Agreement with Things Authoritative.” Journal of Psychology, 

* XL (1955), 215-221. " Е 

McGee, В. К. “Response Style as a Personality Variable: By What 
Criterion?" Psychological Bulletin, LIX (1962), 284-295. (a) 

McGee, R. K. “The Relationship Between Response Style and Per- 
sonality Variable: I. The Measurement of Response Acquies- 
cence.” Journal of Abnormal and Social Psychology, 


(1962), 229-233. (b) 


TIONAL AND PSYCHOLOGICAL MEASUREMENT 
ча. XXIII, No. 3, 1963 


CONVERGENT AND DISCRIMINANT VALIDITY 
OF THE CALIFORNIA PSYCHOLOGICAL INVENTORY 


CHARLES F. DICKEN: 
University of Chicago 


CAMPBELL and Fiske (1959) propose four validation criteria for 
assessment techniques. The convergent criterion requires that the cor- 
relations of two or more independent methods of measuring the same 
trait (heteromethod-monotrait correlations) be significantly greater 
than zero. Validation of a test score against a nontest trait measure 
is an instance of convergent validation. In addition to the convergent 
criterion, Campbell and Fiske stress what they term discriminant 
validity. They observe that if assessment is to be efficient, measures 
of presumably different traits should not correspond too closely. The 
heteromethod discriminant criterion requires that the validity (con- 
Vergent) correlations of a measure exceed the correlations of that 
measure with scores for different traits obtained by different meas- 
urement methods (heteromethod-heterotrait correlations). The more 
stringent monomethod discriminant criterion requires that the con- 
Vergent correlations of a measure exceed the correlations of the 
measure with scores for different traits which share the same meas- 
urement method (monomethod-heterotrait correlations). Failure in 
meeting this criterion may indicate that unwanted “methods factors” 
і have resulted in excessively high interrelationships of scores ob- 
tained by a common measurement procedure. Insofar as measures 


| Of different traits аге not orthogonal, the trait pattern consistency 
p 
Shel author is indebted to Donald MacKinnon, Harrison Gough, and 
lace Hall of the Institute of Personality Assessment and Research, Uni- 
Versity of California, Berkeley, for making available the data on which this 
analysis is based. 


450 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


criterion specifies that different methods of measurement yield a 
similar pattern of trait interrelationships. 

The published validities (Gough, 1957) of the California Psy- | 
chological Inventory (C.P.I.) are two-method convergent validities. 
"Typically, test scores are shown to correlate significantly with non- 
test trait criteria such as ratings or life performances. Personality 
inventory scores tend to intercorrelate, and the C.P.I. has been 
criticized on the basis of inadequate differentiation among traits 
(Thorndike, 1959). The present study examines data from three 
independent samples with respect to the convergent and discriminant 
validity of five C.P.I. variables. 


Method 


Samples. The samples, all male, are 66 student engineers, 70 medi- 
cal school applicants, and 45 research scientists studied at the In- | 
stitute of Personality Assessment and Research, University of Cali- 
fornia, Berkeley. These groups were selected from among others 
assessed at the Institute because of the availability of identical sets 
of C.P.I. scores and relevant behavior rating criteria. 

Rating Scores. Rating scores for each of five traits are the aver- 
ages of ratings by independent observers who observed the subjects 
during a comprehensive, two-day, OSS-type assessment program. 
This program has been described in detail elsewhere (MacKinnon, 
Crutchfield, Barron, Block, Gough & Harris, 1958). The raters ob- 
served group diseussion and problem-solving sessions, laboratory 38- 
sessment procedures, informal social interactions, stress interviews, 
etc., and recorded their judgments for each trait at the close of the 
two-day period. 

Seven raters observed the medical sample, ten raters the sci- 
entists and 40 of the engineering students, and fifteen raters the re- 
maining 26 engineering students. Trait definitions used by the raters 
follow. 

Dominance. Personal ascendance in relations with others (reso- 
lute, self-assured, forceful, not easily intimidated, authoritative). 

Responsibility. Willingness to accept the consequences of one's. 
own behavior; dependability, trustworthiness, sense of obligation to | 
the group. (This need not require the person to assume leadership oF 
direction of group activity). 


CHARLES Е. DICKEN 451 


Impulsivity. Inadequate control of impulse; lacking in self-disci- 
pline: self-centered, quick-tempered, and explosive. 

Intellectual competence. The capacity to think, to reason, to 
comprehend, and to know. 

Rigidity. Inflexibility of thought and manner. Stubborn, pedantic, 
unbending, firm. 

Ratings of the medical and engineering samples were made with 
reference to a five-point, quasi-normal distribution and averaged 
over raters for each subject. The scientists were ranked in subsam- 
ples of fifteen and an average rating score for each subject was 
derived from the ranks. 

Corrected split-half reliabilities (3 raters vs. 3 raters for the 
medical sample and 5 raters vs. 5 raters for the scientists and 40 of 
the student engineers) appear in parentheses in the right-hand half 
of Table 1. The rating reliabilities and the published C.P.I. relia- 
bilities? which appear in parentheses in the left upper part of the 
table are high enough to allow substantial convergencies between 
test scores and rating scores. A possible exception is the Fr—Rigidity 
relationship; even here the upper limit on validity set by unrelia- 
bility of the scores is r = .60. 

The sign of the rating scores for Impulsivity and Rigidity were 
Teversed in computing the correlations so that the direction of the 
ratings would correspond to the pertinent C.P.I. scores and so that 
all variables would be in a “positive” direction. Thus (—) Impul- 
sivity represents a “Self-Control” rating score and (—) Rigidity is 
a "Flexibility" rating score. It may be noted that C.P.I. Sc was 
originally an Impulsivity scale (Im) but was subsequently reflected 
by reversing the keying. 

C.P.I. Scores. These were selected on the basis of rating variables 
available. Each of the four general classes of traits the C.P.I. was 
designed to measure (Gough, 1957) is represented by at least one 
scale. Do (Dominance), Re (Responsibility), and Sc (Self Control) 
were constructed and cross-validated empirically against peer and 
Superiors' trait ratings. Ie (Intellectual Efficiency) was empirically 
Constructed with intelligence test scores as a criterion, and Fz (Flexi- 
bility) was rationally constructed. Convergent validities of these 
scales in various samples are furnished in the Manual (Gough, 1957). 
— 


* Retest 7 to 21 days, 200 prison males (Gough, 1957). 


452 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Results 


Table 1 shows the heteromethod-monotrait convergent validity 
correlations (underlined), the heteromethod-heterotrait values 
(broken triangles), and the monomethod-heterotrait values (solid 
triangles) for each of the three samples. Values of r significant at 
the .05 level are shown with each matrix. The number of times in 
the twelve possible comparisons in which the validities meet the 
heteromethod discriminant criterion appears in the first row below 
the matrices. The same tally for the satisfaction of the monomethod 


TABLE 1 


Three Convergent-Discriminant 
Matrices for Five C.P.I. and Rating Variables 


Staff Ratings 
Dom Res  (-)Isp mo (-)M& 


Semple In 66 
Student Engineers 


Tos = 26 


il... 


BE'SETG-»-230 
Responsibility | 24% oF 12 z o! 


| 
(-)tmpateivity 1-31 -06 8-024 -23 231 
Intellectual P °з Xd Ў 12! 
Competence | NUN OE 
(-)Rigidity |. E IPIE 24> № 
ЕРІ. 
Pe 
Be} 4 
Sample I1: 70 
ES 10 5 MeL ргис 
|29 28 1 105 = 24 
Ex |-20 7-24 л 
Staff Ratings 


Dominans (487 Л T 
Responsibility 1-09“ coz «c2 (в + 
(-)imputeivity р-в 034 PES ET 

1 


Intellectual | 17 о -% or 
elleo 
Competence с Aa “а 


(Rgidity [144 202-18 _ 129 a2 (69) 


loi oM. 29 Wx-06 -al 
M 1 

14 Ob. AN 344 
^ м 

(75) 


f D 
E di 04 -36 18  -10 N 
Meaty 04 5 a 


за sa ea за за ва sa за за за 

tei E. 11111 716.5 8:8 11311 8:7 11:10 5.5155 11:7 12112 9517.5 
otrait Е 

119. 5:4 7:9 12112 10:6 86 4:5 9:5 415 42 


в sign considered 
& sign not considered 
4 : 


discriminant criterion is shown in the second row below. The tallies 
Marked s consider the sign of the correlations; а tallies consider 
absolute values without regard to sign. Table 2 shows the Cre: 
Means (in normative Т score units) and standard deviations (raw 
scores) of the three samples in comparison with Gough’s (1957) ool- 
‘ере males, 

Convergent validation criterion. Although only one scale qe) 
meets this criterion unfailingly, the trend of the convergent validities 
Чог four of five traits indicates modest satisfaction of this criterion. 
"Weighted average validities (via 2 transformation) for the pooled 
Samples and their significance level based on 181 cases are: Do .43 
(01) ; Re .09 (ns) ; Sc .14 (.06); Те 35 (.01) ; and Fz -18 (05). 

_ Heteromethod discriminant criterion. The tally of correlation in- 
а R 5 BUCO itive validities indicate one 
nd o a ae on doube, Bu in factorial sense, о 


terms of predictability of one scale from another, high negative heterotrait 
alues Would seem as unfavorable to discriminant validation as high positive 


ah 


454 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 2 


Means* and Standard Deviations’ of C.P.I. Variables 
——————————————————ЄЄ—Є—Є—=— 


Scale 
Do Re Se Te Fx 
Sample 
680 College Males xX 54 55 47 54 56 
c 6.0 4.7 7.1 5.2 3.8 
Engineers X 52 52 51 56 61 
К c 5.8 4.2 6.5 3.6 2.8 
Medical x 60 56 50 60 59 
т 5.4 3.9 5.8 3.1 3.5 
Scientists £ 60 56 52 60 61 
c 4.7 8.2 5.3 3.0 3.9 
* T-score units 
» Raw score units. 


equalities indicates that Do and Ze generally satisfy this criterion. The 
tallies for 8с and Fz are in a favorable direction but do not exceed 
chance. The Re tally is meaningless because of lack of convergent 
validity. Even in the case of the scales which best meet the hetero- 
method discriminant criterion, the inequalities are often very small. 

Monomethod discriminant criterion. Only Ie unequivocally meets 
the more stringent discriminant criterion. Do tends to, as does Sc in 
the two samples where it is convergently valid. Fz clearly does not 
meet this standard unless the signs of the monomethod correlations 
are considered, and Re of course fails. Again the favorable instances 
of correlation inequality are often small. 

Trait relationship consistency criterion. The intra-C.P.I. trait 
relationship pattern (monomethod-heterotrait correlations) and the 
intra-rating trait relationship pattern (monomethod-heterotrait cor- 
relations) were correlated by pairing the corresponding т values 
(толе paired with тре, ete.). The rho values are .25 for the 
engineers (n.s.), .52 for the medical applicants (n.s.), and 61 for the 
scientists (n.s.). Two of these values are substantial even though 
the few degrees of freedom prevent significance and the fourth 
criterion may be given some support. 


Discussion 
Two C.P.I. variables meet the Campbell-Fiske criteria fairly 
satisfactorily. Do and Те are convergently valid and satisfy both 


CHARLES F. DICKEN 455 


discriminant criteria in a majority of instances. The most valid scale, 
Te, differs logically from the others in that it pertains to a relatively 
well-defined personality construct not interpersonal in character. 
Both the test score and the ratings may have relatively high con- 
struct validity, although the distinctiveness of the latter is not sup- 
— ported by the monomethod discriminant rating tally. 

— Do is one of the most valid C.P.I. scales in Gough’s (1957) data, 
and has been substantially validated in independent studies (Altroc- 
chi, 1959; Smelser, 1961). Do is the most independent of this group 
‘of C.P.I. scales in a factorial study (Mitchell & Pierce-Jones, 1960). 


only one factor on which none of the others load. 
Despite some favorable trends, the other three C.P.I. variables 
generally fail to satisfy the proposed criteria. Low or zero convergent 
"validities are the main difficulty. These negative findings contrast 
with a considerable body of evidence on the relationship of several 
‘CPI. variables to socially-defined behaviors. In addition to the Do 
data, positive evidence on the external validity of the responsibility 
_ 8nd socialization scales (Dinitz, Kay & Reckless, 1957; Gough, 1960; 
‘Shirasa & Azuma, 1961), achievement (т vs. nursing grades = .41, 
Unpublished University of California Nursing School Study), and 
_ profile interpretations (Shapiro, 1957) may be cited. 
ү 4 Tnadequacy of the criteria is one possible source of the present 
- low validities. Despite high reliabilities, homogeneity of the subjeets 
_ and the short observation period may have lessened the value of 
.. Some of the ratings. The discriminant tallies are generally less satis- 
factory for the ratings than for the test scores, and the median rating 
Antercorrelation is greater (.37, sign not considered) than the com- 
Parable value for the C.P.I. scores (21). The sophistication of the 
E - Assessment procedure and of the raters as well as the high reliabili- 
‘ties indicate, however, that the criteria may be about as satisfactory 
| asis possible in the case of behavior ratings. 

Restricted variability and high average scores in the C.P.I. is 
Another possible hindrance to convergent validity in the present 
Samples (Table 2). In addition to statistical constraints on correla- 
? tion by restriction of range, the behavioral relevance of the test 
_ tores may be limited in selected samples of this kind that frequently 
- Score well above the population norm. It must be noted, however, 


456 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


that high means and gross restriction of range is not always the 
case, and that internal comparisons do not substantiate the hypo- 
thesis of restricted behavioral relevance. There is no intra- or inter- 
sample relationship between level of convergent validity and varia- 
bility or mean score. 


The Role of Social Desirability 


The work of Edwards (1953; 1957), Jackson and Messick (1958; 
1960), and others suggests that common social desirability variance 
may be one source of lack of discrimination among self-report meas- 
ures, If such variance could be reduced, discriminant validity might 
be improved. Scores on a 32-item C.P.I. social desirability scale 
(SD) designed for another investigation were available for the 
present samples. SD was based on item desirability judgments and 
developed according to Hanley’s (1957) rationale for maximizing 
individual differences in desirability bias. True and false keying of 
desirable responses is balanced. Corrected split-half reliabilities of 
SD are .62 (100 college males) and .67 (present medical sample).* 
Standard deviations in the present samples are 4.0 (engineers), 39 
(medical), and 3.1 (scientists). 

The correlations of SD with the C.P.I. and rating scores are given 
in parentheses in Table 3. The remaining entries in the table are 
partial correlations (r12. where 1 is а C.P.I. score, 2 is a C.P.I. or 
criterion score, and 3 is SD). The C.P.I. scales relate positively to 
SD, except for Fz, which is consistently negatively related. 

Adjusting the correlations for desirability does not substantially 
affect convergent validity. Some of the discriminant relationships 
appear to be clarified. Consistent with the hypothesis of intercorrela- 
tion on the basis of common desirability variance, the largest posi- 
tive C.P.I. intercorrelations (e.g, Do-Re; Re-Sc) are reduced by 
adjusting for correlation with SD. However, some of the intercor- 
relations increase. For instance, Do and Sc appear to tap dimen- 
sions for which an underlying inverse relationship is masked by 
common desirability variance, Fz shifts toward positive relationships 
with the other C.P.I. scales when its (negative) association with 
desirability responding is controlled, indicating its unique сош- 

4Hanley’s rationale anticipates moderate internal consistencies in popula- 


tions containing both desirability respondents and unbiased (“honest”) 
subjects. 


CHARLES F. DICKEN == 4 


TABLE 3 | NUM 

Adjustment of CPI. Intercorrelations TA 
and Validities for Social Desirability xd 
te 


Dor) R эю le d 


-1 


EE 


кее ё 


Criterion vs sD 


— Medical 


Re 

se t 

le 

Fx 19 
Validity 38* -O1* 35  X* 15 4 


C.P.I. scale vs SD (45) & En 2 ven 


Criterion vs SD (35 


| о 42 ? d & 
“anni BB 8 8 @ 


sa sa sa ва sa 
Validity» 


monomethod= ^ „5 10:10 8:7 
EEE 11:10 7:65 9.555 10: 


wor 
values 


T 1 mre E 
_8 sign considered КАЛ 
К оу солдат Partial correlation original correlation 


Ponent may be more closely related to those of the other scales Hen 
the uncorrected correlations indicate. 


458 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Because of these increases, controlling desirability does not ap- 
preciably improve over-all diseriminant validity. The median abso- 
lute C.P.I. intercorrelation is reduced from .21 to .17 (from .24 to .16 
if Fz is excluded). The monomethod discriminant validity tallies are 
essentially unaffected. 


Summary and Conclusion 


Measurement of five personality dimensions by the California 
Psychological Inventory and by composite ratings of behavior ob- 
servers were compared in regard to four criteria of convergent and 
discriminant validation proposed by Campbell and Fiske. The analy- 
sis was replicated in three independent samples. Four of the five 
C.P.I. variables meet a criterion of convergent validity, two of these 
only minimally. Two C.P.I. variables satisfy criteria of discriminant 
validity, but the others fail in spite of some favorable trends. Low 
validity was discussed in terms of possible criterion inadequacies 
and restriction of the range of the test scores. Accounting for social 
desirability appeared to clarify the interrelationships of the test 
scores but did not improve over-all discrimination among them. 
The discriminant validity problem suggests that other personality 
inventories should be studied in this respect as well as in regard to 


convergent validity of individual scales, Evidence on the former is 
seldom made available, 


REFERENCES 


Altrocehi, J. C. “Dominance as a Factor in Interpersonal Choice and 
M gun" Journal of Abnormal Psychology, LVIII (1959), 


Campbell, D. and Fiske, D. W. “Convergent and Discriminant Val- 
idation by the Multitrait-Multimethod Matrix.” Psychological 
Bulletin, LVI (1959), 81-105. 

Cronbach, L. Essentials of Psychological Testing. (Second Edition) 

_ New York: Harper and Brothers, 1960. 
ршн pend B., REM, W. “Delinquency Proneness and 
chievement," ў > letin, 
(1957), 191 196. ucational Research Bul 

Edwards, A. “The Relationship between the Judged Desirability of 
a Trait and the Probability that the Trait will be Endorsed. 
Journal of Applied Psychology, XX XVII (1953), 90-93. 

Edwards, A. The Social Desirability Variable in Personality Assess 
ment and Research. New York: Dryden Press, 1957. fi 

Gough, H. “Cross-Cultural Studies of the Socialization Continuum. 
American Psychologist, XV (1960), 410-411. 


CHARLES F. DICKEN 459 


Gough, Н. Manual for the California Psychological Inventory. Palo 
Alto: Consulting Psychologists Press, 1957. 

Hanley, C. “Deriving a Measure of Test Taking Defensiveness.” 
Journal of Consulting Psychology, XXI (1957), 391-397. 

Jackson, D. and Messick, S. “Content and Style in Personality As- 
sessment.” Psychological Bulletin, LV (1958), 243-252. 

Jackson, D. and Messick, S. Acquiescence and Desirability as Re- 
sponse Determinants on the MMPI. Princeton, N. J.: Educa- 
tional Testing Service, 1960. 

MacKinnon, D., Crutchfield, R., Barron, F., Block, J., Gough, H., 
and Harris, R. “An Assessment Study of Air Force Officers.” 
Technical Report WADC-TR-58-91 (1) ASTIA Document No. 
AD 151040. Lackland Air Force Base, Texas, 1958. 

Mitchell, Jr, J. V. and Pierce-Jones, J. “A Factor Analysis of 
Gough's California Psychological Inventory." Journal of Con- 
sulting Psychology, XXIV (1960), 453-456. ^ 

Shapiro, A. “Consensus and Accuracy of Personality Appraisals 
Based on Test Protocols.” Unpublished Ph.D. thesis, University 
of California, Berkeley, 1957. 

Smelser, W. T. “Dominance as a Factor in Achievement and Percep- 
tion in Cooperative Problem-Solving Interactions.” Journal of 
Abnormal and Social Psychology, LXII (1961), 535-541. 

Thorndike, В. “California Psychological Inventory.” In О. К. Buros 
(Editor), The Fifth Mental Measurements Yearbook. Highland 
Park, New Jersey: Gryphon Press, 1959. 


x» Y. i 
4 


NOTE ON VOCABULARY TEST CONSTRUCTION 


EDWARD E. CURETON 
University of Tennessee 


_ Wa consider here mainly the construetion of vocabulary tests 
Which are to be used as tests of verbal intelligence. We are con- 
cerned only incidentally with the special problems which arise in 
connection with the construction of vocabulary tests to serve other 
‘Purposes such as, e.g., the testing of the examinees’ precision of 
Knowledge of the meanings of particular words. 
_ In testing verbal intelligence by means of vocabulary, we should 
‘be interested mainly in knowledge of abstract words. Abstract words 
mainly adjectives, adverbs and verbs—usually have synonyms 
"(and/or antonyms); concrete words—mainly nouns—usually do 
not. We therefore confine attention to those words which have syn- 


s (and/or antonyms). This permits the use of compact item 
e-opposite test (or for the 


b 
- structures: simple word-pairs for the sam 
Same-opposite-neither test), or a one-word stem followed by four 
only used synonym 


Or five one-word alternatives for the more comm 


- There should be some slight gain in reliability if all words of а 
given item are of about the same difficulty. Thus the whole item be- 
comes functional; the other words are not merely accessory ар- 
Paratus to permit the recording of knowledge or lack of knowledge 


of the meaning of one word. ' 
— Experience indicates that as an index of verbal intelligence, gen- 
eral knowledge of more difficult abstract words is somewhat better 
than exact knowledge of simpler words. The same-opposite test 
үт ks well, and in the case of the four- or five-choice synonym test, 

? distractors should not be analogs or *ta]most-synonyms" of the 


461 


462 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


stem word. The use of items of the latter type is fully justified in 
the vocabulary sections of English tests, since it tends to raise their 
correlations with other indices of English achievement while reduc- 
ing their correlations with other indices of general verbal intelligence. 

With these considerations in mind, the primary source of material 
for items may well be Webster's Dictionary of Synonyms (1942). 
This dictionary gives not only synonyms, but also analogs (‘‘almost- 
synonyms”), antonyms, and contrasts (“almost-antonyms”). In con- 
structing a four- or five-choice synonym test, the stem and the an- 
swer should each be listed as a synonym of the other, and the author 
should read carefully the main description of the set of synonyms, 
which discusses the shades of difference between them, to be sure 
they are sufficiently alike. No one of the distractors should be listed 
as an analog of either the stem or the answer. 

In writing distractors, the author may use his ingenuity, but even 
here the dictionary may be of some help. Words alphabetically close 
to the stem or answer may sound like one or the other. If they are 
of the same class, their synonyms or analogs may turn out to be 
good distractors. 

In the four- or five-choice synonym test, all the distractors should 
be of the same class as the stem and answer. “Same class” cannot 
be rigidly defined, but it means more than simply the same part of 
speech. Thus if the stem and answer are names of qualities of a per- 
son, all the distractors should also name qualities of a person. 

A fair index of the difficulty of an abstract word which has not 
obviously come into more widespread use or acquired a special 
meaning in recent years is its frequency of occurrence as tabulated 
in Thorndike and Lorge, The Teachers Word Book of 30,000 Words 
(1944). Since the relation between frequency of occurrence and 
number of different words at each given frequency level is approxi- 
mately exponential (Zipf's law), the index of frequency should be 
proportional to the logarithm of the actual frequency per million 
running words. A conversion scale sufficient, for the purposes of vo- 
cabulary test construction is given in Table 1, and its derivation is 
given in the Technical Appendix. 

For item analysis purposes, it is easy in the case of an experi- 
mental vocabulary test to obtain large numbers of subjects. School 
authorities are usually willing to cooperate, and, at the upper High 
School level, students can quite readily complete a 150-item ех- 


EDWARD E. CURETON 463 


TABLE 1 


Logarithmic Scale Values (Е) for Frequencies of Words 
in the Teachers Word Book of 30,000 Words 


——————Є———Є;—— 


Freq. F Freq/m F Freq/18m F 
АА 34 38-49 20 14-17 6 
A 21 29-37 19 11-13 5 
22-28 18 9-10 4 
16-21 17 7-8 3 
12-15 16 5-6 2 
9-11 15 4 1 
7-8 1 
5-6 13 Not in 
4 12 Book 0 
3 11 
2 10 
1 8 


perimental test in one hour. The instructions should emphasize the 
point that every examinee is to attempt every item, and the examin- 
ing period should be long enough to permit compliance. The correla- 
tion between speed and power is neither zero nor unity, and no for- 
mula provides adequate corrections to the item indices of the later 
items when some examinees do not finish the experimental test. 
The writer favors the testing of about 1000 subjects, and the use of 
precisely 926 as the original experimental group. Extra subjects are 
discarded in such a manner as to make the score distribution 
smoother and more nearly normal. With 926 cases, the high-low 27 
per cent method is sufficiently accurate, and the high and low groups 
each include exactly 250 subjects. The per cents correct in upper and 
lower groups are then easily computed by multiplying each number 
correct by 4, dropping the last digit, and rounding the next-to-last. 
To find P, the estimated per cent correct in the whole group, and 
d the item-test correlation, it is convenient to use Fan's Item Analy- 
sis Table (1952). For given values of Ри and P,, P andr can be read 
directly, provided r is positive and P does not lie outside the range 
05 to .95, in either of which cases the item should be discarded. Fan 
also gives a scaled index of difficulty А = 13 + 42, where 2 is the 
Normal deviate corresponding to P, with scale reversed, and A is 
recorded to one decimal. 
The writer prefers a slightly simpler scaled index. If we truncate 
i normal distribution at P = .05 and .95, an index of discrimina- 
on 


7 


464 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
D = 50 + 302, 


rounded to the nearest integral digit, takes values from 1 to 99 as P 
goes from .05 to .95. If the scale is not reversed, the D values, 
though scaled, are numerically not very greatly different from the P: 
values, and are identical at P — 8, 48, 49, 50, 51, 52, and 92. Tab 
2 gives values of D for values of P from .05 to .95. D 
As an index of discrimination, the writer prefers an index propor- 
tional to the Fisher Z-transformation to the r-value itself, even 
though for the widespread-tails tetrachoric т it is neither a variance- . 
stabilizing transformation nor а transformation normalizing t thi 
sampling distribution. At least, however, it removes the upper limit 
+1, and it is both partially variance-stabilizing and partially nor- 
malizing. For use as an index of discrimination, we need merely 
round the value of Z to two decimals and remove the decimal poi 
The resulting values do not reach 100 until r is greater than . 
and item-test correlations seldom exceed this value. ae 
An experimental five-choice synonym test of 150 items was 
structed in the manner described above, and administered to 10 
odd senior high school students in Oak Ridge, Tennessee. As 
quency estimates of the item difficulties, the F-values for the st 
and answer of each item were summed. For the 150 items, the © 
relation between the summed F-values and the D-values compu 
from upper and lower groups of 250 was .78. It appears, therefo 
that estimates of difficulty based on frequency are reasonably goo d 


x 


TABLE 2 К, 


Index of Difficulty (D) from Per Cent Correct 
D = 302 + 50, Where г = Normal Deviate Corresponding to Per Cent Corte 


Л 


P00 01 02 03 04 05 06 07 08 
0 1 3 6 8 
"ha p PICARD T ET NUTA? PEERS I 23 
TOT ONIN eds Leg ай 5] 32 . 88 
POL) 85910186 "sve 96° ^ 39 ^40 41 
АВ A OR 46 — 47 48) 
5 50 51 52 52 53 54 55 55 56 
6. 68 58 t9 6 6 е 62 603 64 
4.51.00 67 67 68 69 70 71 72 73 
ООО Ее о ва RA 85 
9 88 90 92 94 97 99 


EDWARD E. CURETON 405 
REFERENCES 


Fan, Chung-Teh. Jtem Analysis Table. Princeton, N. J.: Educa- 
tional Testing Service, 1952. 

Thorndike, E. L. and Lorge, I. The Teachers Word Book of 30,000 
Words. New York: Teachers College, Columbia University, 1944. 

Webster's Dictionary of Synonyms. Springfield, Mass.: G. & C. Mer- 
riam Company, 1942. 


TECHNICAL APPENDIX 


The main entries in The Teachers Word Book of 80,000 Words 
are based on the number of occurrences of each word in a count of 
18 million running words. All words that occurred as often as 4 times 
in 18 million are included. The exact number of occurrences is given 
for words occurring 4 to 17 times in auxiliary lists at the end of the 
book. The main list, including about 19,500 words, lists frequency 
per million for all but the commonest 1000-odd. The index 1 means 
at least one per million but less than two, 2 means at least two per 
million but less than three, . . . , 49 means at least 49 per million but 
less than 50. The letter A designates words which occur at least 50 
times per million but less than 100, and the letters AA designate 
words which occur 100 or more times per million. 

Frequencies per 18 million were first converted to 3-decimal frac- 
tions of one million. In the main list the index 1 was taken as 1.5, 
2 as 2.5,..., 49 as 49.5; ie. the frequency of each group was 
represented by its mid-point. Logarithms of the exact frequencies 
and group mid-points were recorded to three decimals, and unity was 
added to each logarithm to avoid negative values for the words 
Whose frequencies per million were fractional. 

Logarithms +-1 were also recorded for the frequencies 3.5/18m 
— .194/m, and for 100/m, the lower and upper boundaries exclusive 
of the A and АА words. These logarithms +1 were .288 and 2.099, re- 
spectively, with logarithmic range 2.411. This range was divided 
arbitrarily by 20, yielding the increment .12055. Then starting with 
288, the increment .12055 was added 20 times, giving the boundaries 
for a 20-group scale. 

The boundaries of the A group are 50 and 100, with logarithms -+1 
equal to 2.699 and 3.000. Continuing the addition of .12055, there 
are two group boundaries, 2.820 and 2.940 in this region. The mid- 
Point on the logarithmic scale is the average of 2.699 and 3.000, or 


466 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


2.8495. This is closer to 2.820 than to 2.940, so group A was given 
the scale value 21. 

We do not know precisely the upper boundary of the AA group, 
but it can be estimated roughly. In the Lorge magazine count of 
approximately 5 million words [“nearly a million words from each 
of the following: (5 magazines)"], the word “the”—the commonest 
word in the English language—occurred 236,472 times. If we divide 
by 5 we obtain the value 47,294 per million. If we take the “nearly” 
literally, as meaning somewhat less than 5 million, we can use the 
round figure 50,000 per million; this would be estimating the total 
number of words counted at 4,729,400. Since the magazine count is 
not wholly representative of the full count of 18 million, the esti- 
mate 50,000 for the upper boundary is probably about the best we 
can do. 

For the AA group we then have boundaries 100 per million and 
50,000 per million, with logarithm -+1 values 3.000 and 5.699. For 
the midpoint, the logarithm +1 value is 4.3495. Continuing to add 
12055, we find that the scale value for group AA, i.e., the scale value 
for 4.3495, is 34. If it were actually broken down into smaller groups, 
there would be a total of 23 such groups in the AA range. The word 
“the” would have scale value approximately 46. 

There is one other anomaly. Two group boundaries occur between 
the logarithm +1 values for 17/18m and 1.5/m, and two occur be- 
tween the values for 1.5/m and 2.5/m. Hence there are no frequency 
groups corresponding to scale values 7 and 9 in Table 1. 

Та constructing vocabulary tests, we try to keep the scale value 
of the answer within 5 or 6 points of that of the stem, and the scale 
values of the distractors within 9 or 10 of that of the stem. Because 
of the great range of group AA, it would seem to follow that if the 


stem is an AA word, the answer and all distractors should also be 
AA words. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vol. XXIII, No. 3, 1963 


A MODEL OF ITEM AMBIGUITY IN PERSONALITY 
ASSESSMENT! 


LEWIS R. GOLDBERG 
University of Oregon 


and 
Oregon Research Institute 


ONE of the most common criticisms heard from subjects upon 
taking almost any popular personality inventory is that "so many 
of the items seem ambiguous." Further questioning will often elicit 
statements such as “Many of the items I'd have answered one way 
today and another way tomorrow” or “The items seemed so ambigu- 
ous that I was never sure how to respond.” 

That subjects may become annoyed with the form, content, or 
Tesponse structure of some particular test items would not in itself 
be cause for grave concern, if the instruments which utilized these 
items yielded highly satisfactory test scores. However, few psycholo- 
gists would question the need for improving present-day personality 
tests, and one might well begin by investigating the apparent diffi- 
culties encountered by subjects in responding to individual test 
items. 

In any study of item characteristics, one cannot help but be 
impressed by the marked inconsistency of response elicited by most 
Personality test items. Evidence of such response instability has 
——— 


1 The author gratefully acknowledges the stimulation and help provided by 
Раш J. Hotton’ RL and others at the Oregon Research Institute. 
3 paper has profited from the critical reading of earlier drafts by Hofiman 
and LaForge as well as by Donald Fiske, Leonard Rorer, Jerry 8. Wiggins 
and Nancy Wiggins. This project has been supported by two grants from the 
Graduate School of the University of Oregon, and Grant #0-25123 from the 
National Science Foundation. 


467 


468 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


been reported in the psychometric literature since the early 19 
Table 1 lists some representative figures from this literature show- 
ing the percentage of items changed on retest by the average sub- 
ject for some typical early personality inventories. " 
Table 2 lists the percentage of items changed by the average 
subject for six more recently developed personality and "response 
sei" measures (Goldberg, Dufort & Hammersley, in preparation). 
Approximately 300 college undergraduates were administered 8 
one-hour battery of tests, and 10 random samples from this popula- 
tion were retested over 10 different test-retest intervals. Table 2 
lists the average percentage of items changed for each of the six 
measures for а three-week test-retest interval. Since the amount 
of item change is obviously related to the number of response op- 
tions considered (Fiske, 1957), the data are analyzed for different 


TABLE 1 
Percentage of Items Changed by the Average Subject under Test-Retest Conditions: 
Some Representative Figures from the Psychometric Literature 
=С=——————————Є—Є—Є——= 


Number of 
Response 
Test-Retest Options Average % 
Reference Personality Test Interval Considered Change _ 
Lentz (1934) Bernreuter Person- 1 month 3 20 — 
ality Inventory 
Neprash (1936) Thurstone Person- 2 weeks 2 14 
ality Schedule 4 weeks 2 1 
8 weeks 2 
Benton and Landis and Zubin immediate 3 H 
Stone (1937) Personal Inquiry 1 day Б n 
Form 3 days 3 16 
4 days 3 19 
5 days 3 19 
7 days 3 19 08 
8 days 3 2 1 
21 da 3 
Farnsworth Bernreuter Person- 1 vod 3 2 
(1938) ality Inventory 2 years 3 35 
3 years 3 35 
Eisenberg and Thurstone Neurotic 23-28 days 3 15 Ж 
Wesman (1941) Inventory 
Glaser (1949) California Test of 1 month 2 We 
Personality ў 
Chance (1955) Bell Adjustment М days 3 5 
уеп{огу 
Strong (1962) Strong Vocational 3 days 3 a 


Meet UI HDL RETA an av д дый ЭШЕ 


LEWIS В. GOLDBERG 469 


TABLE 2 


Percentage of Items Changed upon Retest by the Average Subject 
as a Function of the Number of Response Categories 
————  Є 


Number 


of Percentage Change when Items are Scored as: 
Response Absolute % 
Scale Options Dichotomous Trichotomous Change 
MMPI: Factor A 2 14 — 14 
MMPI: Factor R 2 14 14 
Berg Perceptual 
Reaction Test 4 30 31 52 
Rust and Davies 
Reported Behavior 
Inventory 3 9 16 16 
Bass Social 
Acquiescence Scale 822% 22 36 36 


Couch and Keniston 


Agreement Response 
Sel Scale 7 24 34 59 
eR eSI O 


numbers of response categories, thereby permitting the direct com- 
parison across measures utilizing different response formats. 
Although the findings presented in Tables 1 and 2 indicate а 
marked degree of response instability on items from both older and 
More recent personality inventories, an even more significant indi- 
cation of intra-individual variability becomes evident when one 
examines the test-retest reliability coefficients for modern person- 
ality scales. The majority of these average no more than .70 (for 
example, Jackson, 1961), indicating that roughly one-third of the 
total scale variance is error of measurement, if we make the assump- 
tion that the personality trait to be measured is relatively invariant, 
That is, although psychologists might well expect some changes in 
responses from one situation to another different one, most psy- 
chologists concerned with human measurement expect some stability 
of responses over time when the testing conditions appear to be 
reasonably constant and the traits to be measured are assumed to 
be relatively enduring ones. In this paper, it is assumed that some 
important personality traits are stable over short periods of time, 
and the discussion will be limited to the psychometric measurement 
of such stable traits. One possible way to reduce the error now 
associated with the measurement of these traits is to construct more 
Stable items to use as initial item pools for later scale development. 


470 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Parameters of Item Stability 


Common sense considerations, augmented by the reflections of 
tested subjects, would suggest that a crucial component of item 
stability is certainly item ambiguity. Ambiguity has been tradi- 
tionally defined as doubtfulness or uncertainty in the meaning of a 
stimulus; it often denotes a stimulus whose meaning is open to 
various interpretations. Ambiguity can be measured subjectively by 
rating methods or objectively as some function of either of two 
indices: (a) inter-individual variability in the meaning of a stimu- 
lus (e.g., Broen, 1960) or (b) intra-individual variability in mean- 
ing over repeated administrations of the stimulus (i.e., interpreta- 
tive instability). It is apparent that indices utilizing either of these 
two measures will usually be highly correlated, for a stimulus 
which elieits disparate meanings among a group of persons will 
typically elicit unstable meanings for an individual over time. The 
relationship between inter-individual and intra-individual response 
variability in personality assessment has been documented by Mitra 
and Fiske (1956) and Fiske (1957). The relationship between intra- 
individual variability and ambiguity has such strong rational ap- 
peal that investigators have been led to use intra-individual re- 
sponse variability (item instability) as a direct measure of item 
ambiguity. 

However, there is a serious error in such a practice. It has been 
demonstrated repeatedly that items of extreme endorsement fre- 
quencies (for example, items to which 90-100 per cent of a popu- 
lation choose the same alternative) tend to be responded to more 
consistently (i.e. with a smaller percentage of subjects who change 
their responses) over repeated administrations of the item than 
more balanced items (i.e., items to which a population’s choices ар" 
proach a 50-50 split between alternatives). In other words, items 
at the extremes of any attribute continuum have been shown t0 
elicit less change in responses than those reflecting positions in the 
middle range of the attribute (Frank, 1936; Hertzman & Gould, 
1939; Eisenberg & Wesman, 1941; Mitra & Fiske, 1956; Crockett 
Bates & Caylor, 1958). This finding, of the relative stability of 
extreme items, has occurred in virtually all studies, regardless of 
the assessment instrument utilized. It seems directly analogous to 
findings from the area of ability testing, where items at the average 


LEWIS В. GOLDBERG an 


difficulty level for a population tend to be less stable over retesting 
than items of extreme ease or difficulty. Figure 1 is an example of 
the typical curvilinear relation found between stability and endorse- 
ment frequency.” 

Since items seen by subjects as highly “ambiguous” can dichoto- 
mize the attribute continuum at many points, it is desirable to 
eliminate the effect of differences in endorsement frequency from 
an index of item ambiguity. That is, the practice of utilizing the 
percentage of subjects changing their responses to the item as an 
index of item ambiguity leads one to the position that items 


so 90 


o 10 20 зо 40 50 60 70 юо 


ENDORSEMENT FREQUENCY 


Figure 1. Endorsement frequency vs. percentage of subjects changing re- 
sponses on second administration of 160 adjectives (N = 96 females). 


—_— 

The relationship illustrated in Figure 1 does not result solely from that 
existing between the standard error of a proportion and the magnitude of 
the proportion (in this case, the item’s endorsement frequency). Changes in 
endorsement frequency due to sampling errors provide a lower bound for the 
degree of instability of any item, but most present-day personality test items 
аге so unstable that this lower bound does not serve as à useful approximation 
to the item's actual instability; that is, the correlation between the mean 
Shift in endorsement frequency over two administrations of an item and the 
Percentage of subjects changing their responses to the item 18 virtually zero 
(for example, r=.13 [and .09] for 95 male [and 108 female] college students 
administered the MMPI and then retested after four weeks). 


472 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT _ 


dichotomizing an attribute continuum at some extreme point & 
all less ambiguous than items which more evenly dichotomize th 
continuum. Ап alternative proposal is to formulate an index of item 
ambiguity which is based upon the percentage of response chang 

but which takes into account the effects of item imbalance. 


The Response to a Monotone Item 


To more fully understand how such an index might be € 
structed, let us consider the response of a subject to a test 
It is important to distinguish between two major forms of items, 
called “monotone” and “non-monotone” items by Coombs (1952 
and Green (1954). The defining characteristic of a "monotone 
item is that as the magnitude of the attribute measured by the 
item increases, the probability of any single response category 
(“True,” “False”; “Yes,” “No”; etc.) increases monotonically W 
it. That is, the item dichotomizes the attribute continuum such 
there is only one boundary between “Yes” and “No” regions. 
item "I am tall” is an example of a monotone item; as hei 
increases in the sample, the probability of a “True” response 
creases monotonically with it. Now consider a “поп-топој 
item—for example, “I am of medium height.” In this case, as 
all non-monotone items, the probability of a particular respo 
not monotonically related to the underlying attribute continuu 
The item divides the continuum into three or more regions 
least two boundaries are required to separate “No” from ‘ 
regions. 

Although some factual items, as well as many attitude items, 
non-monotonic, the majority of our present-day personality 
items appear to be monotones in form. Take the hypothetical 
"I am a shy person,” or—a similar item—the adjective “Shy’ 
ministered in an adjective check list. For persons who see we 
selves as forceful and extroverted, these items are easily ch 
“False.” As progressively more retiring sorts of persons are sele 
one begins to find subjects hesitating in responding to the item, 
they were asking themselves, “Just how shy does a person hai 
be before he considers himself ‘shy’?”, Eventually one finds 
jects responding “True” to this item, and as progressively 
shy individuals are selected, one would probably find that the i 


LEWIS В. GOLDBERG. 478 


with which they answer the item in the “True” direction increases. 

Being a bit more precise, for an hypothetically “honest” indi- 
vidual the decision to respond “True” or “False” to a dichotomously- 
scored monotone item is assumed to be a joint function of: (a) the 
perceived boundary established by the item on the attribute con- 
tinuum, and (b) the individual’s perceived position on this con- 
tinuum. If the item is perceived as having a boundary on one side 
of the individual’s own position, one alternative is checked; if the 
item is seen as having a boundary on the other side of the indi- 
vidual’s position, the other alternative is checked. 

Figure 2 is an attempt to graph an hypothetical frequency distri- 
bution of subjects on the attribute continuum of “perceived shy- 
ness” and to indicate where the item “Shy” in an adjective check 


HYPOTHESIZED 
FREQUENCY 
DISTRIBUTION 
OF SUBJECTS 


LA 
EXTREMELY OUTGOING ITEM “SHY” VERY SHY 


HYPOTHESIZED ATTRIBUTE CONTINUUM . 
Figure 2. Frequency density and attribute intensity. 


— 

‚5 As Edwards (1957) and his associates have frequently warned, persons may 
differ in their tendencies to admit socially undesirable aspects of themselves; 
Consequently, an “hypothetically honest” individual may be rare. Social 
desirability considerations, however, can probably best be conceptualized ав 
influencing a subject’s perception of his own position on the attribute con- 
tinuum (and/or the degree he is willing to report an “honest” appraisal of 

Position) and thus may leave the essential components of this more simple 
model unchanged. 


474 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


list might fall on this underlying continuum. Note that 70 per cent 
of the population sees this item as establishing a boundary above 
the point on the attribute continuum where they see themselves 
and thus respond “False” to the item; the remaining 30 per cen 
perceives the item as establishing a boundary below the point where - 
they see themselves and thus respond “True.” For those individuals _ 
who see themselves far from the boundary, В, (i.e., in regions A or O 
of this distribution), the decision in regard to this item would bean _ 
easy one to make, since the item's boundary should be perceived as 
quite distant from the individual's position. Individuals close to B, 
however, should face more diffieulty since the boundary lies иж 
to their own position on the attribute continuum. 3 

Those individuals who encounter the most difficulty in making & 
decision upon the first presentation of an item should be the persons 
most likely to change their response to the item upon its subsequent 
presentation. Assuming the usual normal frequency distribution 0 
subjects on most attribute continua, items whose boundaries ate” 
seen by the great majority of a population as quite distant from 
their own (i.e., extreme items) will be responded to without diffi- 
culty for most of the population upon the first presentation of the 
item and will be changed only by some members of that very small 
subsection of the population which is itself very extreme on te 
attribute continuum. On the other hand, items with boundaries fall- 
ing near the middle of the attribute continuum (i.e., balanced items) 
will find many individuals having difficulty in responding to them 
and many more individuals changing their responses upon retesting 


A Model of Item Ambiguity 


Obviously the boundary established by an item is not perceis od 
at the same point on the attribute continuum by everyone in the 

population. More realistically the boundary of an item can be con- 
ceived as occupying a “band” on an attribute continuum correspond 
ing to its perceived position by different members of a population 
One could conceptualize the “equivocality-band” of an item as ih 
range of disagreement in regard to the item’s boundary ОП 
attribute continuum. If every individual in the population sees j 
item’s boundary as falling at the same position, the item could b e 
said to have a minimum equivocality band; if some individus 
perceive the item's boundary as being extreme in one direction wai 


LEWIS В. GOLDBERG 475 


other individuals perceive it as extreme in the other direction, the 
item could be said to have a maximum equivocality band. 

It is assumed that there is a strong relationship between the 
degree to which different persons agree in their positioning of the 
boundary of an item on an attribute continuum (upon the first ad- 
ministration of the item) and the extent to which individuals are 
consistent in positioning the boundary upon successive presentations 
of the item. That is, it is assumed that items of broad equivocality 
bands (as defined by different members of the population on the 
first administration) will be items of great intra-individual varia- 
bility in their positioning over repeated administrations of the items 
and can therefore be considered as having wide “ambiguity bands.” 

Previously, it has been hypothesized that the closer an individual 
perceives his positiont on an attribute continuum to be to the 
boundary of an item, the more difficulty he will have in responding 
to the item upon its first administration and the more likely will he 
be to change his response upon its readministration. In addition, it 
seems reasonable to assume that the wider the ambiguity band of 
an item (i.e., the greater the intra-individual variability in bound- 
ary positioning) the less stable should be the item. Conversely, the 
narrower the ambiguity band of the item, the more stable it 
should be. 

In summary, then, the stability of any item (defined in the tradi- 
tional way as the percentage of individuals who give & consistent 
response to the item over repeated administrations) depends upon 
(a) the narrowness of the ambiguity band of the item (i.e, how 
Specifically localized is the perceived position of the item's boundary 
on the attribute continuum over time by the average individual) 
and (b) the frequency density at the item's boundary point, That is, 
there should exist two kinds of “stable” and two kinds of “unstable” 
items: an item can be stable because it is highly specific and pre- 
cisely localized (narrow ambiguity band) and/or an item сап be 
stable because it is extreme and few subjects are found near the 


UEM nU mA 


"ambiguity band" for an individual's 
um rye response stability could 


Seen as a joi i interacting ambiguity bands. Moreover, 
joint function of two inte g is in multi-dimen- 


items and persons might also be conceptualized as points | d 
sional space, rather than as points on a single underlying attribute seen nie 
rationales for these more complex models have not yet been completed. 


* Obviously, one could hypothesize an 
perceived position on an attribute continuum, an! 


4716 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT _ 


item's boundary. Conversely, an item may be unstable because 
has a broad ambiguity band (i.e., it is perceived as being in 
spots on the trait continuum by the average person on differe 
occasions) and/or because it lies in the middle of the attribut 
continuum (where there are many individuals clustered). ___ 
Figure 3 is an attempt to depict these notions graphically ft 
three items, “U,” ^V," and “W,” located on the same attribute ec 
tinuum. Note that each item is conceived as having an ambiguity 
band, corresponding to its perceived position on the attribute 
‘tinuum by the average person on different occasions. Item U i 
graphed as having a narrow ambiguity band; that is, it is pereeivet 
on most occasions as being in about the same position on the 
bute continuum. Item W is graphed as having a much broader 
biguity band and as being a more extreme item. Item V is gre 
аз also being relatively extreme but having an ambiguity 
similar to item U. Note that most persons would respond “ 
to item V and "False" to item W, and that they could make 
choices rather easily. 
The shaded areas above each item represent the тоо 


А 


individuals who would have difficulty making a decision for the 


/ 


HYPOTHESIZED 
FREQUENCY 
DISTRIBUTION 
OF SUBJECTS 


SSO 


/ 


ITEM V ITEMU 


HYPOTHESIZED ATTRIBUTE CONTINUUM 
Figure 3. Item ambiguity bands and item endorsement A ШЫ 


LEWIS В. GOLDBERG 47 


particular item. These are the persons most expected to change 
their responses upon repeated administrations of the item, As can 
be seen from Figure 3, although item W has a wide ambiguity band, 
owing to its extreme position on the attribute continuum it elicits 
about the same percentage change as item U. Item V, with the 
same width ambiguity band as item U, elicits the least percentage 
change. 

In actual practice, the percentage of individuals who change their 
responses to a given item can be readily determined by test-retest 
procedures, as can the item’s endorsement frequency. The remaining 
problem is to ascertain the width of the item’s ambiguity band (a 
distance along the assumed attribute continuum), given these two 
indices. 


Derivation of an Index of Ambiguity 


The preceding discussion provides the rationale behind the index 
of item arbbiguity. For every two administrations of any item pool, 
two indices are immediately available for each item: (a) а measure 
of item endorsement frequency (actually two such measures are 
available, one for each administration of the item; the average 
endorsement frequency over both administrations of the item is the 
index presently utilized 5), and (b) а measure of item instability 
(defined as the percentage of subjects in the population who change 
their responses upon a second presentation of the item). The prob- 
lem is to derive an index of ambiguity, based upon item instability, 
which corrects for item endorsement frequency. Only the monotonic 
case will be treated here. 

Figure 4 graphs the geometric rationale behind this ambiguity 
index, called, for short, Ambdez. 


For Figure 4: 
E — Endorsement frequency (shaded area above) 
A = Ambdex (index of ambiguity; width of ambiguity band) 


X, = Scale value (E** percentile) on the attribute continuum 
which cuts from the normal curve an area equal to the 
average endorsement frequency of the item 

ЈОХ) = Ordinate of the normal curve erected at X; 


_ 
5 The geometric mean is being considered as an alternative to the arithmetic 
Mean utilized in the preliminary index. 


478 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


HYPOTHESIZED 
FREQUENCY 
DISTRIBUTION 
OF SUBJECTS 


Low x 


E 
HYPOTHESIZED ATTRIBUTE CONTINUUM 
Figure 4. A model for the index of ambiguity (Ambdex). 


X, X, = Scale values on the attribute continuum such that 


Хх, sa X, < А 
В = А = Х,-– Ху = Хь – X, 4 1 
1 = Average percentage of change (index of instability) 


taken as equal to the area under the normal curve: 
between X, — В and X, + 8 


The index of instability (I) for any item is represented by tht 
portion of the normal curve between X, — 6 and X; + 8 and E 
here defined is equal to the percentage of persons changing theim 
responses to the item upon its second presentation. Ху and f (Xa) 
can be readily found from normal curve tables entered for the а em 
cut off the normal curve equal to the item’s endorsement frequency, 
averaged over both administrations of the item. The problem ч 
to solve for the distance, A, between points X, and X», each equi- 
distant from Хь. ke 

The solution involves integrating the equation for the no mal 
curve between — œ and Х + В and subtracting the integral betwe 
— and X, — B, setting this value equal to I and solving for iy 
Although no formal solution exists, 8 can be readily found by nit" 
polation from the table of values of the unit normal curve. 


-LEWIS В. GOLDBERG 479 


In practice, а simpler approximation is merely to define the area 
of the rectangle: 


I = А-Х 
as equivalent to the associated area under the normal curve. Then 


I 
атт: 

This approximation will slightly underestimate Ambdex when the 
endorsement frequency is 50 per cent, and it will slightly overestimate 
Ambdex for extreme endorsement frequencies. It will be most 
accurate when X; falls directly under the curve’s inflection points. 

Table 3 lists some representative values of Ambdex for various 
item endorsement frequencies and various percentages of subjects 
changing their responses. 

Rust (1961) reported that Ambdex was highly related to an esti- 
mate of r-tetrachorie in a population of 169 male undergraduates 
tested on a 160-item adjective check list at the end of their fresh- 
man and senior years at Yale. N. Wiggins (1961) has shown that 
Ambdex and r-tetrachoric are mathematically related, though not 
identical, statistics. 


TABLE 3 
Ambdex as a Function of Endorsement Frequency and Response Instability 
50] 00 02 05 .08 10 .12 .15 .18 .20 .23 .25 .38 .50 .63 .75 .88 
x 45 00 03 05 08 .10 .13 .15 .18 20 .23 .25 38 .51 .63 46 88 
N 40] 00 03 05 .08 .10 .13 16 .18 21 23 25 39 .52 .65 .78 .91 
D ‘35| 00 03 05 08 11 14 16 19 22 .24 .27 41 .54 8 81 .95 
R .30 00 03 06 .09 12 .14 17 20 .23 .26 .29 43 .58 42 87 
È 25| 00 03 06 09 13 16 19 22 25 28 .31 47 .63 .79 94 
M 20 00 .04 07 .11 .14 .18 21 .25 29 .32 .36 .54 Ji 89 
M 15] .00 04 .09 13 .17 .22 26 30 34 .39 43 .64 .86 
0| 00 05 11 17 23 .29 .34 .40 .46 .51 .57 .86 
E 09) .00 .06 12 .19 .25 .31 .37 .43 .49 .56 .62 .93 
Ё 08] .00 .07 14 20 27 34 A1 АТ .54 .61 .68 
9 .07| 00 .08 15 .22 .30 .37 .45 .52 .60 .67 .75 
E .00| .00 .08 .17 .25 .34 .42 .50 .59 .67 .76 -84 
М .05 .00 .10 .19 .29 .39 .49 .58 68 .78 .87 .97 
Y .04 00 .12 .23 .35 .47 .58 .70 .81 .93 
-03| .00 .15 .29 .44 .59 .74 .88 
02| 00 .21 42 .63 .83 
:00-.01] .00 .37 .74 


uice Feta а тааз CE 
.00 .01 .02 .03 .04 .05 .06 .07 .08 .09 .10 .15 .20 25 .30 .35 
Percentage of Subjects Changing Response on Second Administration 


5 
480 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT P 
Р OU 


AMBDEX 
= 
o 


Domeno 20140 ло ло so то 59 ME 
ITEM ENDORSEMENT FREQUENCY | 


Figure 5. Index of ambiguity vs. item endorsement frequency for 82 males. 
administered 160 adjectives. H- 


Since Ambdex was constructed specifically to provide an index of 
ambiguity which would be independent of item endorsement fre- 
quency, a first test of the proposed statistic was to ascertain its 
relationship with item endorsement frequency for various subj eu 
groups and various item pools. Figure 5 shows one scatterplot ill А 
trative of the lack of relationship between endorsement frequency 
and Ambdex (eta — .01). The data plotted in Figure 5 come from 8 
double administration of an 160-adjective check list (three week 
test-retest interval) to 82 University of Oregon male undergr 
ates from an Introductory Psychology course. Note that the attempt 
to detach ambiguity from item endorsement frequency appears to 
have been successful for this: sample. Rust (1961) found both 
Ambdex and r-tetrachoric unrelated to item endorsement {frequency 
in his sample. | 


The Use of Ambdex for Item Selection 


Personality test development has been marked by diligent interest 
in the principles of scale construction and marked neglect of the 
properties of the stimuli from which the scales are normally con- 


\ 
} 
| 


| as to how to insure a given degree 


LEWIS R. GOLDBERG 481 


structed. It may well be that our most powerful strategies for item 
grouping are never able to overcome the defects present in the 
initial item sample. Therefore, our best hope for more valid assess- 
ment tools may well lie in the increased understanding of basic 
item properties. 

In the history of test development, item construction has typically 
been considered a relatively inconsequential task. Although the 
direct investigation of personality test item properties has a long 
history (e.g. Bain, 1931; Smith, 1933; Lentz, 1934; Benton, 1935; 
Pintner & Forlano, 1938), early investigators—while spotting some 


. of the more salient problems in item construction and interpretation 


—Soon reached impasses that discouraged further explorations. 


- Meanwhile, test developers were relying far too heavily on ra- 
tionally-constructed items, building face-valid scales and assuming 


their relevance to the measurement task at hand. The wave of pure 
empiricism which swept American assessment circles in the decade 
following World War II was a healthy force in placing the disei- 
pline of personality assessment on firmer methodological grounds. 
During this period there emerged explicit rationales for minimizing 
the importance of such item properties as item structure and con- 
tent. One of the earliest of these rationales was presented by Paul 


* Meehl in his influential defense of pure empiricism in inventory 


construction (Meehl, 1945) ; a more extreme position with respect to 


item content has been taken by Irwin Berg (1959). 


The rather basic property of item stability (response consistency 
over repeated administrations of the item) seems not to have been 
exempted from the general neglect of item characteristics, and the 
reliability of the basic units of personality measurement has not 


received the systematic attention it deserves. Were one to inquire 
of reliability in a personality 


scale (other than simply varying the length of the test), one could 


not find a complete answer in the literature. 


As was pointed out earlier, investigators in the 1930" were inter- 


ested in improving the stability of their scales through an examina- 


tion of item properties, but they encountered an apparent enigma 
—namely, that the most discriminating items (e.g., between neu- 
Toties and normals) were the very items which were responded to 
most inconsistently in the general population. That is, 
that the least stable items were the most useful predictively. In 


retrospect, the enigma is not difficult to understand. The items which 
\ 


they found 


482 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


were most useful were balanced items, and the items which were 
most stable were extreme items. Thus, early investigators (e.g, 
Bain, 1931; Benton, 1935; Frank, 1936; Hertzman & Gould, 1939; 
Lentz, 1934) were faced with the choice of having stable scales 
(composed of extreme items) which were non-differentiating, or 
scales composed of items with some predictive efficiency but low 
temporal stability. 

This paradox may arise primarily from the strong relationship 
between inter-individual and intra-individual variability. Isard 
(1956) has found that the most discriminating forced-choice items 
in predicting college achievement are those whose alternatives elicit 
the most inter-individual variability in their preference ratings. 
Since inter-individual response variability is essential for person- 
ality assessment, i& may be necessary to purchase inter-individual 
variability in stimulus "meaning" (equivocality) to achieve this 
end. 

In a very significant paper, Broen (1960) specified the conditions 
under which changes in equivocality should lead to increased item 
discriminating power. Normally the test constructor seeks to (a) 
Maximize inter-individual response variability (often, as Broen 
notes, by sanctioning equivocality) and simultaneously (b) mini- 
mize intra-individual variability of response (and consequently of 
stimulus meaning—i.e., ambiguity). However, since items of high 
equivocality are typically items of high ambiguity, the paradox is 
complete. To assess stable components of personality, it would 
appear necessary either to (a) use the average values of repeated 
measures (an expensive and time-consuming process) or (b) de- 
velop an item technology to the point where items can be con- 
structed which elicit both high inter-individual variability and low 
tntra-individual variability. 

The search for both stable and discriminating items leads to 
another enigma, namely, that typically the more an item reflects 
some aspect of behavior that is directly observable and easily 
identifiable the more stable is the item; on the other hand, the 
more an item reflects some attitude, value, or other “internal” state 
of mind the more unstable is the item. Consequently, at present, 
inventory constructors could be faced with the dilemma of choosing 
stable items reflecting attributes of relatively little personological 
importance or selecting items of decreased triviality but increased 


LEWIS В. GOLDBERG 483 
intra-individual variability. Fer this reason, the investigation of 
stable biographical inventory questions (e.g., Rust, 1961; Owens, 
Glennon & Albright, 1962) for utilization in personality inventories 
is certainly worth continued effort. 

Another approach lies in the general investigation of item prop- 
erties and the development of objective indices to guide initial item 
selection (prior to actual scale construction). Ambdex is suggested 
as a possible index to be used to minimize item ambiguity. It may 
also be useful to develop an objective index of equivocality (utiliz- 
ing the dispersions of item boundary placements inter-individually) . 
to provide a parallel measure with Ambdex. As Broen (1960) has 
illustrated, in some prediction situations the test constructor may 
wish to minimize both equivocality and ambiguity, thus loading 
his initial item pool with low Ambdex and low equivocality items. 
For other situations, however, in which increased equivocality is 
useful, one might select for an initial item pool those items having 
high equivocality /Ambdex ratios. 

With the exception of Horn (1950), the psychometric literature 
discloses no opponents to decreasing intra-individual variability, at 
least until the hypothesized stability of the personality trait itself 
is attained. Unfortunately, there is less agreement on the extent of 
personality trait variability (e.g., Cattell, 1957; Fiske, 1961; Secord 
& Backman, 1961). A general review of the literature on variability 
can be found in Fiske and Rice (1955). Some important empirical 
findings on the stability of diverse traits over long intervals of 
time (as measured by early psychometrie scales) are reported by 
Kelly (1956). 

Although there is general concordance on the importance of 
intra-individual response stability in psychometric assessment, the 
traditional neglect of item properties has turned research attention 
away from the systematic investigation of item stability. With one 
Very notable exception (Bills, Vance & McLean, 1951) no major 
Personality inventory has been constructed by explicitly utilizing 
item stability as a criterion for item selection. There are, however, 
recent indications of a renewed interest in the stimulus character- 
isties of items (Owens, Glennon & Albright, 1962; Hanley, 1962; 
Edwards & Walsh, 1963) and a trend toward analysis of item 
Parameters associated with specific response patterns. The breadth 
of explanatory power generated by scaling the single item parameter 


84 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


of social desirability has been repeatedly documented by Edwards 
(1957). The distinetion between content and style (Jackson & Mes- 
sick, 1958) and the formulation of item properties as mediators of 
components of strategie, method, and stylistic variance in assess- 
ment (Wiggins, 1962) are significant steps forward. These recent 
theoretical contributions, taken together with the important demon- 
strations that systematic variation of item properties results in 
systematic variation in response patterns (Buss & Durkee, 1957; 
Buss, 1959; Hanley, 1959; Stricker, 1960; Eliott, 1961; Goldfried & 
McKenzie, 1962; Aiken, 1962), suggest that personality test item 
construction is progressing from an art to a science. The present 
paper aims to channel a part of this trend to the area of item 
stability. 


Summary of the Ambiguity Model 


The preceding discussion has focused exclusively upon the practi- 
cal uses for an ambiguity index. However, considerations of the 
role of response variability and stimulus ambiguity in personality 
assessment also uncover some significant theoretical issues, and the 
ambiguity model provides the potential opportunity for explaining 
a large number of relatively diverse empirical findings. Conse-_ 
quently, the model will now be summarized in its present prelimi- 
nary form, with the hope that it will stimulate independent investi- 
gations which in turn may help clarify the ambiguities now present 
in the ambiguity model. The current model applies solely to mono- 
tone items, scored dichotomously; the more general case is being 
developed. Brief descriptions of the major postulates and theorems 
of the model follow. 

Postulate 1: The closer the perceived position of a person is to 
the boundary established by the item on the relevant attribute con- 
tinuum (as perceived by the person), the more difficult will be his 
decision for that item. Some manifestations of difficulty include: 
(a) response in a middle, or “?,” category if this is allowed in the 
response format (or refusal to respond—leaving the item blank—if 
a middle category is not provided), (b) latency of response (greater 
latency indicating greater difficulty), and (c) ratings of item 
difficulty (or, conversely, ratings of low judgmental confidence). 
Consequently, the following predictions follow from this postu- 
late: 


LEWIS В. GOLDBERG 485 


(Р:1а) More items whose boundaries are perceived as close to а 
subject's position than items perceived as further away will be 
placed in a middle, or “?,” category (if this is allowed in the re- 
sponse format) or will be left blank if a middle category is not 

„provided. 

(P:1b) The latency of response should be greater for items whose 
boundaries are perceived as closer to a subject’s position than for 
those perceived as further away. 

(Р:1с) A person should report more difficulty in responding (and 
consequently less confidence in his response) to items whose bound- 
aries are perceived as closer to himself than to those perceived as 
further away. 

(P:1d) The three manifestations of difficulty should be inter- 
related. 

Postulate 2: The closer the perceived position of a person is to 
the boundary established by the item on the relevant attribute con- 
tinuum (as perceived by the person), the more unstable will be his 
response. 

Theorem 1: Combining Postulates (1) and (2): 

The more difficult is a person’s response to an item upon its first 

administration, the more likely is he to change that response upon 
retesting. Consequently: 

(T:1a) Items placed in a ^?" category (or left blank) will be 
relatively unstable. 

(T:1b) Items which elicit long response latencies will be rela- 
tively unstable. К 

(T:1c) Items rated difficult (or given low confidence ratings) 
will be relatively unstable. 

Postulate 3: The distribution of subjects on the attribute con- 
tinuum is specified. In practice, subjects are assumed to be dis- 
tributed in approximately normal curve frequencies on most psycho- 
logical attribute continua. 

Theorem 2: From Postulate (3): 

Fewer persons will see their own positions as falling close 
extreme item’s boundary than to а more balanced item. 

Theorem 3: Combining Postulate (1) and Theorem (2): 

Items of extreme endorsement frequencies, as compared to more 
balanced items, will be seen by fewer individuals as difficult items. 
Consequently: 


to an 


486 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


(T:3a) Balanced items will be placed in a “?” category, or will 
be left blank, by more individuals than will extreme items. | 
(T:3b) Balanced items will elicit longer response latencies. 
(T:3c) Balanced items will be judged more difficult (or given 
lower confidence ratings). 
"Theorem 4: Combining Postulate (2) and Theorem (3): \ 
Items of extreme endorsement frequencies, as compared to more 
. balanced items, will be relatively stable. 
Postulate 4: The greater the inter-individual variability in the. 
positioning of an item's boundary on a relevant attribute: continuum | 
(ie, the more disagreement among persons as to the position of | 
the boundary of an item on the attribute continuum), the greater - 
will be the intra-individual variability of item positioning over 
time (ie, the more likely will individuals be to vary in their. 
positioning of the item's boundary upon repeated administrations 
of the item). That is, equivocality is related to ambiguity. Since 
manifestations of intra-individual variability in boundary position- 
ing should include both subjective ratings of ambiguity as well as _ 
an objective index of ambiguity (Ambdex), consequently: і 
(Р:4а) Intra-individual variability in item boundary positions 
ing will be directly related to ratings of item ambiguity. 
(P:4b) Intra-individual variability in item boundary position- 
ing will be directly related to Ambdex values. 
(P:4e) Ratings of item ambiguity will be related to Ambdex 
values. 4 


(P:4d) Ratings of item ambiguity will be correlated with E 
equivoeality. 
(P:4e) Ambdex values will be correlated with item — — 
Postulate б: The more intra-individual variability in the posi 
tioning of an item’s boundary, the more unstable will be the response 
to that item. 
(P:5a) Response instability will be correlated with ambiguity 
ratings. i 
(P:5b) Response instability will be correlated with Ambdex 
values. M. 
Theorem 5: Combining Postulate (4) and Postulate (5): ) 
The more inter-individual variability in the positioning of an 


item's boundary, the more unstable will be the response to that 
item. 4 


LEWIS К. GOLDBERG ~ 487 


In summary, response stability can be considered as а positive 
function of (a) self-item distance (for an individual) and (b) item 
extremeness (for a group), and a negative function of (c) item 
ambiguity, (d) item difficulty, and (e) item equivocality. 

Table 4 summarizes the postulates and theorems of the model 
and indicates those which have received empirical test. 

Evidence for some of these assumptions can be found in а recent 
study by Hanley (1962). Using 75 MMPI items, Hanley dem- 
onstrated that a measure of response latency and а measure of 
judgmental confidence were both negatively correlated with ex- 
tremeness of item endorsement frequency (i.e. latency as well as 
difficulty were inversely related to item balance). Moreover, Hanley 
showed that item length, which Strong (1962) found to be related 
to ambiguity in SVIB items, was also highly related to response 
latency and difficulty. 

A very recent study by Edwards and Walsh (1963) provides 
additional support for the model. When 176 miscellaneous person- 
ality statements were administered twice to 110 male and 111 fe- 
male college students, item stability was found to be related to the 
extremeness of item positioning on the social desirability dimension 
(as independently assessed by ratings from 47 male and 48 female 
college students) and item stability was negatively related to the 
variance of these social desirability ratings. Moreover, this study 
replicated that of Dodd and Svalastoga (1952) in demonstrating 
that item stability was inversely related to the percentage of sub- 
jects utilizing a middle, or “?,” category when this response option 
is permitted. In Dodd and Svalastoga's study, 271 respondents in а 


poll on national affairs replied by mail to а second poll in which 


seven questions from the original poll were embedded. For these 
seven repeated questions, a correlation of .91 was found between 
the percentage of “don’t know” responses on the first poll and the 
percentage of responses changed between poll and re-poll. Another 
demonstration of the same theorem has been provided in a study in 
which consistency of response to an ll-item attitude scale was 
shown to be inversely related to response latency (Crockett, Bates & 
Caylor, 1958). BAS 
Postulate (4) of the ambiguity model states that inter-individual 
variation in the positioning of an item's boundary should be directly 
related to the average intra-individual variation of boundary posi- 


488 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 4 
Postulates and Theorems of the Ambiguity Model 


Formula 
Postulate (P) 
or Individual 
Theorem (T) Case Group Case Empirical Test 
Pil Dy == f(dij) D. = f(d.;) 
Ла О. = f(d.;) 
Pub Li; = f(dij) L.; = f(d.;) 
Р:1с Rij = f(dij) К.; - f(d.;) 
Pd 0.; = f(L.i) 
Li; = (В) L.; = Е.) 
О; = Е.) 
Р:2 Ii, = f(di.) I.. = f(d..) 
T: Ii. = f(Di.) I.. = Қр.) 
Таа i. ={(О.) I.. =£(0..) Dodd & Svalastoga (1952); 
Edwards & Walsh (1963) 
Tib 1. = f(Li.) I.. = f(L..) Crockett, Bates, & Caylor 
Tic 1. = f(Ri.) I.. = f(R..) (1958) 2 
P3 Specification of distribution of individuals on underlying attribute 
continuum (here assumed to be the normal distribution) 
T2 d. = f(.5 — pl). 
T3 D.; = f(|.5 — pl).i 
T:3a 0.i = f(1.5 — pl); 
T:3b L.; = f(|.5 — pl).; Hanley (1962) 
T:3c R.; = f(|.5 — р|).; Hanley (1962) 
a L. = {(|.5 — pl).. Edwards & Walsh (1963) 
РА S. = 157) Goldberg (present report 
P:4a a.; = f(S.) 
P:4b A... = (85; 
Р:4с a.. = f(A.) 
P:4d a., = f(5;.) 
P:4e A.. = {(S.) 
P:5 L. = f(8.) Edwards & Walsh (1963) 
P:5a I.. = f(a..) 
P:5b I.. = f(A..) 
T:5 L. = f(Si) 
Notation: i = individuals; j = occasions; k = items (omitted from the notation) 


= Loeb self-item distance 

= endorsement frequency (proportion); |.5 — p| = item balance 

7 response instability (proportion of. эе Биг changed) 

= dispersion of item boundary positions 

7 response latency 

7 percentage of individuals omitting ite ing i ig" 

= “difficulty” ratings ing item (or placing item in “?” category) 
= "ambiguity" ratings 

= Ambdex values 
bar over a symbol indicates a mean value taken over the indicated subscript] 


T>? Боншно оу 


tioning over repeated administrations of the item. Evidence вир- 
porting this assumption can be seen in Figure 6. 
The data plotted in Figure 6 come from a study in which 160 


LEWIS В. GOLDBERG 489 


400 
5 зо 
E 
и. 
2 
Z зоо 
$ 
E 250 
= = 
= 5 
> Е 2.00 

oc 
25 
3 2 150 
> = 

а 
2 4 100 
© 
M о 
= 

00 50 100 150 2.00 200 


AVERAGE INTRA-INDIVIDUAL VARIATION OVER TIME 
Figure 6. Average inter-individual variation vs. average intra-individual 
variation for ten subjects positioning 160 adjectives on a social desirability 
continuum over ten administrations. 


adjectives were administered on ten occasions (approximately one- 
week interval between administrations) to ten University of Oregon 
undergraduates; the task of the subjects was to position each ad- 
jective on the underlying attribute continuum of “Social Desir- 
ability,” defined for them in approximately the same manner as 
that used by Edwards (1957). The correlation between variation 
in item positioning inter-individually and variation in item position- 
ing intra-individually (over time) was 52. 


Summary 


A major stumbling block to improved personality scales may 
arise from defects in the initial item pool utilized for scale construc- 
tion. One of the most important defects of both early and recent 
personality test items may be their ambiguity, here considered as а 
property of the item which tends to elicit marked intra-individual 
variability in response over repeated test administrations. Conven- 
tional measures of item instability (ie. the percentage of subjects 
changing their responses to an item upon its repeated administra- 


490 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


tion) have been shown to be closely related to item endorsement 
frequency and thus can be shown to have rather severe limitations 
as indices of ambiguity. This paper attempts to integrate existing 
knowledge concerning item instability, and а model of item am- 
biguity and response instability is presented. An item statistic, 
Ambdex, derived from the model is proposed as a preliminary index 
of item ambiguity, and its use as an item selection statistic is 
discussed. 


REFERENCES 


Aiken, L. R., Jr. “Frequency and Intensity as Psychometric Re- 
sponse Variables.” Psychological Reports, XI (1962), 535-538. 
Bain, R. “Stability in Questionnaire Response.” American Journal 
of Sociology, XXXVII (1931), 445-453. 
Benton, A. L. “The Interpretation of Questionnaire Items in a Per- 
sonality Schedule.” Archives of Psychology, XC (1935), #190. 
Benton, A. L. and Stone, I. R. “Consistency of Response to Per- 
sonality Inventory Items as a Function of Length of Interval 
between Test and Retest.” Journal of Social Psychology, vill 
(1937), 143-146. 
Berg, I. A. “The Unimportance of Test Item Content.” In B.M. 
Bass and I. Berg (Eds.) Objective Approaches to Personality 
‚ Assessment. New York: Van Nostrand, 1959, 83-99. Ч 
- Bills, R. E., Vance, E. L., and McLean, О. 8. “An Index of Adjust- 
ment and Values" Journal of Consulting Psychology, 
(1951), 257-261. 
ue ү Е, осени код Discriminating Power in Per- 
nality Inventories.” Journal of Consulting Psychology, 
` (1960), 174-179. ; oii gie 
Buss, A. Н. “The Effect of Item Style on Social Desirability and 
Frequency of Endorsement.” Journal of Consulting Psychology, 
XXIII (1959), 510-513. 
EE Hi dnd ee x “An Inventory for Assessing Different 
nds of Hostility.” Journal of Consulting Psychology 
(1957), 343-349. 4 mene 
Cattell, R. B. Personality and Motivation: Structure and Measure- 
ment. New York: World Book Company, 1957, 589-631. 
Chance, June E. “Prediction of Changes in a Personality Inventory 
on Retesting." Psychological Reports, I (1955), 383-387. . 
Coombs, C. H. “A Theory of Psychological Sealing.” University 
jd m Bulletin, No. 34, Engineering Research Institute, 
Crockett, W. H., Bates, C., Jr., and Caylor, J. 8. “Intra-judge Con- 
sistency and Inter-judge Agreement in Responses to Attitude 
Scale Items.” EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 
XVIII (1958), 597-605. y 
Dodd, S. C. and Svalastoga, K. “On Estimating Latent from Mani- 


LEWIS В. GOLDBERG 491 


fest Undecidedness: The ‘Don’t Know’ Percent as a Warning 
of Instability among the Knowers.” EDUCATIONAL AND PSYCHO- 
LOGICAL MEASUREMENT, XII (1952), 467—471. 

Edwards, A. L. The Social Desirability Variable in Personality As- 

° sessment and Research. New York: Dryden Press, 1957. 

Edwards, A. L. and Walsh, J. A. “Relationships between Various 
Psychometric Properties of Personality Items." EDUCATIONAL 
AND PSYCHOLOGICAL MEASUREMENT, XXIII (1963), 227-238. 

Eisenberg, P. and Wesman, A. С. “Consistency in Response and 
Logical Interpretation of Psychoneurotic Inventory Items." 
Journal of Educational Psychology, XXXII (1941), 321-338. _ 

Elliott, L. L. “Effects of Item Construction and Respondent Apti- 
tude on Response Acquiescence." EDUCATIONAL AND PSYOHOLOGI- 
CAL MEASUREMENT, X XII (1961), 405-416. ; 

Farnsworth, P. В. “A Genetic Study of the Bernreuter Personality 
Inventory.” Journal of Genetic Psychology, LIL (1988), 8-13. 

Fiske, D. W. “The Constraints on Intra-individual Variability in 
"Test Responses." EDUCATIONAL AND PSYCHOLOGICAL MEASURE- 
MENT, XVII (1957), 317-337. і 

Fiske, D. W. “Тһе Inherent Variability of Behavior." In D. W. 
Fiske and S. R. Maddi (Eds.) Functions of Varied Experience. 
Homewood: Dorsey Press, 1961. UN dur 

Fiske, D. W. and Rice, L. “Intra-individual Response Variability. 
Psychological Bulletin, LII (1955), 217-250. ! 

Frank, B. “Stability of Questionnaire Response." Journal of Ab- 
normal and Social Psychology, XXX (1936) , 320-324. * 

Glaser, В. “A Methodological Analysis of the Inconsistency of Re- 
sponse to Test Items.” EDUCATIONAL AND PSYCHOLOGICAL Mzas- 
UREMENT, XIV (1949), 727—739. у - 

Goldfried, М. R. and McKenzie, J. D., Jr. “Sex Differences n ae 
Effect of Item Style on Social Desirability and I dod 
Endorsement." Journal of Consulting Psychology, XXVI (1 » 
126-128. 

Green, B. F. “Attitude Measurement.” In G. Lindzey (Ed.) Hand- 
book of Social Psychology. New York: Addison-Wesley, 1954. 

Hanley, C. *Responses to the Wording of Personality Test Items. 
Journal of Consulting Psychology, XXIII (1959), mecs 

Hanley, C. “The ‘Difficulty’ of a Personality Inventory pus 
EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, XXII (196 ў? 
571—584. LAP 

Hertzman, M. and Gould, R. “The Functions) БЕМ nina of 
Changed Responses in a Psychoneurotic Inventory. 

Abnormal eg Social Psychology, XXXIV (1939) 336-350 

Horn, D. “Intra-individual Variability in the Study of 
Journal of Clinical Psychology, VI (1950), 49-47. . d Di 

Isard, E. S. “The Relationship between Item Ambiguity y tied 
dm ei. Power in a ооб о Scale." Journal of Apphe 

sychology, XL (1956), 266-268. е > 

Jackson, DN and Measick, S. “Content and Style in Personality 
Assessment.” Psychological Bulletin, LV (1958), 243-252. 


” 


49  EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Jackson, J. M. “The Stability of Guilford-Zimmerman Personality 
Measures." Journal of Applied Psychology, XLV (1961), 431- 


434. 

Kelly, E. L. “Consistency of the Adult Personality." American 
Psychologist, XI (1956), 659-681. 

Lentz, T. F. “Reliability of the Opinionaire Technique Studied In- 
tensively by the Retest Method." Journal of Social Psychology, 
V (1934), 338-364. 

Meehl, P. E. “The ‘Dynamics’ of Structured Personality Tests.” 
Journal of Clinical Psychology, I (1945), 296-303. 

Mitra, S. К. and Fiske, D. W. “Intra-individual Variability as Re- 
lated to Test Score and Item." EDUCATIONAL AND PSYCHOLOGICAL 
MEASUREMENT, XVI (1956), 3-12. 

Neprash, J. A. “The Reliability of Questions in the Thurstone Per- 
sonality Schedule.” Journal of Social Psychology, VII (1936), 
239-244. 

Owens, W. A., Glennon, J. В. and Albright, L. E. “Retest Con- 
sistency and the Writing of Life History Items: A First Step." 

‚ Journal of Applied Psychology, XLVI (1962), 329-331. 

Pintner, В. and Forlano, С. “Four Retests of a Personality Inven- 

a Journal of Educational Psychology, XXIX. (1938), 93- 


Rust, R. M. “A Comparison of Statistical Indices of Item Stability 
and Subject Stability.” Paper read at Western Psychological 
Association meetings, 1961. 

Secord, Р. Е. and Backman, С. W. “Personality Theory and the 
Problem of Stability and Change in Individual Behavior: An 
оа Approach." Psychological Review, LXVIII (1961), 

Smith, M. “A Note on Stability in Questionnaire Response." Amer- 
ican Journal of Sociology, XXXVIII (1933), 713-720. 

Stricker, L. J. “Some Item Characteristies that Evoke Acquiescent 
and Social Desirability Response Sets on Psychological Scales.’ 
Unpublished Ph.D, thesis, New York University, 1960. 

Strong, E. .K., Jr. “Good and Poor Interest Test Items." Journal 

$ of Applied Psychology, XLVI (1962) , 269-275. 
Wiggins, J. 8. “Strategic, Method and Stylistic Variance in the 
1 MMPI.” Psychological Bulletin, LIX ( 1962), 224-242. 

Wiggins, Nancy. “On the Mathematical Relationship between Amb- 

dex and r-Tetrachoric.” Unpublished paper, 1961. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vol. XXIII, No. 3, 1963 


THE MEASUREMENT OF SELF CONCEPT 
AND SELF REPORT 


ARTHUR W. COMBS, DANIEL W. SOPER 
anp CLIFFORD C. COURSON 


University of Florida 


Ir is probable that no area of psychological research is currently 
more popular or more confused than that having to do with the 
measurement of the self concept. In a recent review Wylie (1961) 
lists 493 articles and research reports on the self concept. Ап exami- 
nation of these reports is a. bewildering experience, however, for it 
quickly becomes apparent that most studies purporting to explore 
the self concept are, in fact, not measures of the self concept at all. 
They are studies of the self report! (See, for example, Balester, 
1956; Benjamins, 1952; Bice, 1954; Bills, 1954; Bills, Vance & Ме- 
Lean, 1951; Brownfain, 1952; Calvin & Holtzman, 1953; Epstein, 
1955; Fitts, 1954; Hartley, 1951.) In view of the great current in- 
terest in “self” psychology, this confusion seems most unfortunate 
(Gordon & Combs, 1958). If the self report and the self concept are 
truly different psychological constructs, confusion over whatis really 
being measured can only result in chaos in the literature with some 
things seemingly proven that really are not so and others rejected 
as false that really are true. ў 

Combs and Ѕорег (1957), in an earlier article on the self and its 
derivate terms, called attention to this confusion and attempted to 
define the two constructs with greater clarity. They pointed out that 
the self concept and self report are quite different concepts which 
can by no means be used interchangeably. A number of their col- 
leagues, however, have argued the point, claiming the self report is 
à valid indication of the self concept and have so used it in further 
researches. 'This present study is an attempt to shed further light on 


493 


Ж, 


> a 
494 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


“the relationship of self concept and self report and to highlight some ` 


of the problems of measurement involved in their exploration. 


The Problem 


The “self concept,” as it is generally defined, is the organization 
of all that seems to the individual to be “I” or “me” (Combs & 
Snygg, 1959). It is what an individual believes about himself; the 
totality of his ways of seeing himself. On the other hand, the “self 
report” is a description of self reported to an outsider. It represents 
what the individual says he is. To be sure, what an individual says 
of himself will be affected by his self concept. This relationship, 
however, is not a one-to-one relationship. The self report will rarely, 
if ever, be identical with the self concept. The self report is essen- 
tially an introspection and is no more acceptable as direct evidence 
of causation in modern phenomenological psychology than in earlier, 
more traditional schools of thought (Bakan, 1954; Boring, 1953). 

How closely the self report approximates the subject's "real" self 
concept will presumably depend upon at least the following factors: 


1. The clarity of the individual's awareness. 

2. The availability of adequate symbols for expression. 

3. The willingness of the individual to cooperate. 

4. The social expectancy: 

5. The individual's feeling of personal adequacy. 

6. His feeling of freedom from threat (Combs & Snygg, 1959; 
Combs & Soper, 1957). 


Clearly, then, on logical grounds the self report cannot be used 
as a direct measure of the self concept. Yet this is precisely what 8 
large number of studies in the current literature have done. How 
then can we get closer to the self concept? | ‘ 

Since the self concept is an organization within the individual's 
perceptual or phenomenal field, it is not, open to direct observation. 
To study the self concept it is necessary to infer its nature from 
observations of the behavior of the individual. One class of be- 
haviors which may be used as a basis for making such inferences, 0 
course, is what the subject has to say about himself. The probability 
of accurate inference, however, will be greatly increased if a larger 
sample of behavior is used as data from which inferences are made. 
One way of doing this is to use trained observers who a) make 


ARTHUR W. COMBS, ET AL. “©  — 4% 


careful observations of a subject under a variety of circumstances, „ 
and then b) infer the nature of the individual's ways of perceiving 
himself and his world. Combs and Soper (1957) have called this 
“the use of the observer as instrument.” y” 

The "inferred self concept" obtained in this fashion is based upon 
the assumption that, if behavior is a function of perception, i 
should be possible to observe behavior and infer the nature of the 
self perceptions which produced that behavior. This “reading be- 
havior backward” escapes most of the sources of error indicated 
above for the introspective self report but, of course, does not have 
а perfect relationship to the self concept either. Such inferences are 
the stock in trade of the clinical psychologist. Their accuracy ав 
self concept descriptions will’ be dependent upon the sensitivity and 
skill of the observer. Such skill and sensitivity, however, are subject 
to training and so lie far more within the control of the experimenter 
than does the subjects’ self report. The inferred self concept escapes 
most of the obvious sources of error affecting the self report. We 
may therefore presume on logical grounds that the inferred self 
concept is probably a much more accurate measure of the self con- 
cept than the subject’s self report. 


+ 


*& 


» 


^ 


j 


Is the present use of the self report as а direct measure ofthe; ~“ 


self concept in the current literature justified? If it is, then the 


correlation between the inferred self concept and self report should 


be high. However, if the self report is quite different from the self 


concept, then the correlation between self report and self concept к: 


should be low or nonexistent. This experiment was designed to test Р. 
which of these conditions exists. The hypothesis wee. stated rh ; 
Prediction as follows: There will be no significant 


tween the inferred self concepts of children obtained from obrva- + t 


tion of their behavior and the self reports obtained direclly sie 
the children. To test this prediction it was planned to have train йе 
observers infer the self concepts of children, and to ask children 


rate themselves on the same instrument. s 
The Inferred Self Concept—Self Report Scale ; | : 
To А ? sim le i was needed on. 

st the hypothesis a very simp: and which could also: 


. which children could record their self reports bild 
be used by the research team to record inferences а 3 | 2 
Self concept. The authors, therefore, constructed a simple keale con- 


46 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


sisting of eighteen pairs of positive and negative statements 
self, arranged on a five-point scale. 
Items 


1. People like to have me around. . . . People don't care if 
there or not. 
2. I feel sad a lot... . I feel happy most of the time. 
3. People don't like me very well. . . . People usually like me 
4. I look pretty good. ... I don't like my looks. k 
5. I usually feel bad, get tired quickly. . . . I usually feel good, 
have lots of energy. 
6. Teachers like me pretty well. . . . Teachers don't шу 
much. P 
7. Mostly I feel sure of myself. . . . I’m not very sure of my 
8. Kids don't like me much. . . . Kids like me pretty well _ 
9. I feel worthwhile. . .. I don’t amount to much. ( 
10. Things are often too much for me to deal with... . I 
handle things pretty well. E 
11, I don’t think I’m brave. ... I think I'm a brave person. - 
12. I'm doing very badly in school. . . . I'm doing okay in schoo! 
13. I don’t think I’m smart. . . . I think I am pretty smart. - 
14. I'm really a nice person. . . . I'm not a nice person. і 
15. Most school work is fun. . . . School work is not much fun: 
16. I'm not much good at sports and games... . I’m good at 
sports and games. : 
17. People think I'm important. . . . People don’t think I'm very 
important. 
18. It's fun being me. .. . I wish I were somebody else. 


The scale items were randomized such that the positive and пера 
statements did not always run in the same direction. No labels‘ 
descriptions were placed on the five-point digit spread separa g 
the positive and negative statement so that the values invested i 


the scale would be more likely to be personal than externally sul 
gested. ї 


The Self Report Data 


The Self Concept—Self Report Seale (SC-SR) was adminis 
to all 59 pupils in the sixth grade at the Р. K. Yonge Laborator 


ARTHUR W. COMBS, ET AL. 497 


School at the University of Florida. 1% was administered to the 
class in two groups with the following directions: 


This is a test to see how a person describes himself. Read 
each sentence carefully. Rate each sentence according to the 
way it best fits you as а person. There are five ways you can 
rate the sentence. Each of these five ways is described by а 
number. Circle the number that best describes how the sentence 
fits you. Be sure to complete the ratings for each sentence, 


The results of this testing constituted the “self report” data of this 
investigation. 

No attempt was made to test the reliability of the children’s 
statements. The statements of the children were accepted as reliable 
per se. The question of this research is to determine if children’s 
statements about self are comparable to inferred self concept ratings. 
If children’s self reports are not reliable, then this, or any other, 
research on the topic is futile. We have, therefore, accepted the 
children’s ratings as honest, accurate statements of their attitudes. 


The Self Concept Inferences 
tance of find- 


This study was made possible by the happy circums 
ing ourselves with an experienced research team from another, much 
larger, study. The four persons on this team had been carefully 
selected and trained to make self concept inferences of school chil- 
dren from repeated observations of their behavior under both free 
and controlled conditions. At the time of this research they had 
Worked for an academic year in this way and there was statistical 
evidence of the reliability of their ratings on this kind of task. 

Our trained research team also used the SC-SR scale to report 
their inferences about the children's self concepts after careful ob- 
Servation of each subject. Each subject was observed in the glas: 
Toom through a one-way vision screen for approximately thirty 


minutes, A second half hour observation Was made outside the ela ч 

Toom, usually on the playground. A third half hour was spent in а 

(ns Ы S dM А hild was shown am- 
Dieture story test" interview in which each chi 2 

biguous drawings depicting home, school, and other interpersonal 


—— 

„ "The authors are dee teful for the devoted service of Kay Chamber- 
Clark, Evelyn E Tames Fisher, and Michael Whetstone. Without 

their efforts this joint project would not have been possible. 


m" “EDUCATIONAL*AND PSYCHOLOGICAL MEASUREMENT — 

м " ^ situations. Subjects were directed to tell a story about each of 
` . pictures and the ensuing interviews were flexible and open-e 
following the lead of the child or the examiner’s interest. From tl 

' total hour and a half of observations and interview, an inference а 

. . to the nature of the child's perceptions of self was scored on t 
SC-SR scale by the research team member. In this manner, in 

self concept ratings were obtained for each subject. These wer 

then compared with the self reports obtained from the children, 


J 


Results 


To determine the relationship of self report to inferred self c 
cept, Pearson r's were computed for each of the eighteen i em! 
the SC-SR scale. Each self rating was compared with the со! 
ponding inferred rating for all 59 children. These correlation cod 
cients’ are reported in Table 1. It will be noted that they rang 
from —.199 to +.336. To arrive at an over-all coefficient, t 
eighteen correlations were converted to z-scores and averaged. 
gave a mean correlation of .114, which was not statistieally signi 
cant. These results strongly indicate that there is no significant 
tionship between the inferred self concepts of these children 
their self reports. 

The results of this study appear to support the theoretical 
tion that the self concept and the self report are quite dif 
concepts. Though they may bear some relationship to each 
they can certainly not be used interchangeably as personality 
ures. As we have suggested earlier, the confusion of these constr 


TABLE 1 


Pearson Correlations Between Self Report and Inferred 
Self Concept Ratings on Eighteen Test Items 


(N = 59) 
Item Nos. r Item Nos. r 

1 +.123 10. —.115 
2. —.019 п. —.199 
3 +.164 12, +.135 
4 +.336 13. +.060 
5 +.083 14. +.151 
6 +.149 15. +.120 
7 +.249 16. 4-.245 
8 +.285 17. +.142 
9 +.089 18. +.026 

Mean +.114 


+o ee чу. “„ “Аты, cix teu m 

ARTHUR W. COMBS, san Niente CN : 
in the design of research must necessarily lead to confusion in the’ , 
interpretation of research results. To be sure, the self report is ар ХУ - 
much simpler and more direct device for measurement than the  . - 
inferential maneuvers required for studying the self concept. It lends ^. 
itself to accustomed designs and comfortable statistical techniques. ^ 
The most beautiful research design and the most highly significant ~ 
statistics, however, may only obscure understanding if basic con- 
ceptualization is not accurate. Self report studies are valuable in 
their own right. We need such information. But when such experi- 
ments masquerade as self concept studies the damage they can do is 
very great; valid theory may be disproven, for example, while false 
assumptions are given the support of “scientific proof.” Unhappily, 
this seems to be the very situation we confront. A very large num- 
ber of studies in the literature, ostensibly concerned with the self 
concept, actually turn out on closer examination to be studies of 
self report. A review of research for this article, for example, in- 
cluded more than fifty. This is а great waste. It is our hope that + 
this study may at least serve to call attention to the problem and 
so prevent further loss of valuable research effort in our profession. 


Summary 


This study was designed to test whether the self report can justi- 
fiably be used as a direct measure of the self concept. It was 
predicted that children’s self reports would show no significant rela- 
tionship to self concept inferences made by trained observers. Fifty- 
nine sixth grade children reported their feelings about themselves 
on eighteen items of a specially-prepared self perception report 
sheet. The average correlation of the two kinds of ratings was .11, 
indicating no significant relationship as predicted. TA 

The authors relate these findings to widespread confusion in the 
use of these concepts in much current literature. 


REFERENCES dv 
Bue, D. “A Reconsideration ofthe Problem of Introspection. 
] i 1954), ‚М 
Bale Ко а ВИ and Juvenile Delinquency.” Unpub- 
lished Ph.D. thesis, Vanderbilt University, 1956. ТУ, 
enjamins, J. “Changes in Performance in Relation 2 p. 
Upon Self Conceptualization." Journal of Abnormal 


Psychology, XLV (1952), 473-480. ; 
Bice, ii “Some ed m Contribute to the Concept of Self in 


500 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


the Child with Cerebral Palsy." Mental Hygiene, XXXVIII 
(1954), 120-131. 
Bills, В. E. “Self Concepts and Rorschach Signs of Depression.” 
_ * Journal of Consulting Psychology, XVIII (1954), 135-137. | 
~ Bills, В. E., Vance, E. L. and McLean, О. 8. “An Index of Adjust- 
e pnd Values.” Journal of Consulting Psychology, XV (1951), 
7-261. 

Boring, E. G. ^A History of Introspection.” Psychological Bulletin, 
L (1953), 169-186. 

Brownfain, J. J. “Stability of the Self Concept as a Dimension 
of Personality.” Journal of Abnormal and Social Psychology, 
XLVII (1952), 597-606. 

Calvin, A. D. and Holtzman, W. H. “Adjustment and the Discrep- 
ancy Between Self Concept and Inferred Self.” Journal of Con- 
sulting Psychology, XVII (1953) , 39—44. и 

Combs, А. W. and Snygg, D. Individual Behavior (Revised Edi- 
tion). New York: Harper & Brothers, 1959. 

Combs, A. W. and Soper, D. W. “The Self, Its Derivative Terms, 
he ot, Journal of Individual Psychology, XIII (1957), 

Epstein, 8. “Unconscious Self Evaluation in a Normal and Schizo- 
phrenic Group." Journal of Abnormal and Social Psychology, 

_ L (1955), 65-70. 

Fitts, H. “The Role of the Self Concept in Social Perception.” Un- 
published Ph.D. thesis, Vanderbilt University, 1954. 

Gordon, I. J. and Combs, А. W. “The Learner: Self and Percep- 
E Review of Educational Research, XXVIII (1958), 433- 


Hartley, Margaret. “Changes in the Self Concept During Psycho- 
, therapy." Unpublished Ph.D. thesis, University of Chicago, 1951. 
Siegel, S. Nonparametric Statistics. New York: McGraw-Hill, 1956, 
Taylor, C. and Combs, A. W. “Self Acceptance and Adjustment. 
Journal of Consulting Psychology, XVI (1952), 89-91. 
Wylie, Ruth C. The Self Concept. Nebraska: University of Nebraska 
Press, 1961. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vol XXIII, No. 3, 1963 


MULTIPLE SCALOGRAM ANALYSIS: A SET-THEORETIC ° 
MODEL FOR ANALYZING DICHOTOMOUS ITEMS! 


JAMES C. LINGOES 
The University of Michigan 


Amone the various criticisms that have been directed against 
the scaling technique of Guttman (1944), а major one has been 
in reference to his concept of a "universe of content" (see, for 
example, Festinger [1947] and Loevinger [1948]). This particular 
concept lies at the basis for item construction and selection in Gutt- 
man’s scalogram method. Quite generally what is meant by this 
phrase is the set of all statements which may be made in reference 
to a single variable or trait, as for example, “love of country,” 
"morale," “motherliness,” etc. The crucial part of this broad defi- 
nition rests in the undefined term “single.” The lack of a clear 
meaning for the important concept of а “universe of content” and 
the absence of definitive rules for selecting items relevant to the 
universe have presented both conceptual and practical problems 
standing in the way of a fuller acceptance and a wider application 
of Guttman’s scale analysis. 

Another, and equally important, criticism has been directed 
against the criterion of reproducibility (Festinger, 1947). It would 
be desirable to have something more than a rule of thumb for 
guarding against spuriously high reproducibilities as a function 
of extreme marginal values. 

Ls 


1 The author wishes to express his appreciation to Professors С. F. Wrigley, 
C. H. Coombs, and J, E. Milholland for their helpful comments and criticisms 
in the preparation of this paper. He would also like to acknowledge the co- 
Operation of Professor R. C. F. Bartels for making the IBM 704/700/7090 
eee facilities freely available for the development and testing of the 
method. 


501 


502 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


This paper will present а completely objective and empirical 
procedure for selecting dichotomous items which meet the Guttman 
sealing criteria. The data are, in effect, permitted to “speak for 
themselves," without imposing restrictions in advance of our ex- 
ploring just what, if any, universes or domains are involved. We 
wish to propose a method of analysis which will extend Guttman's 
method to the determination of multiple dimensions for dichotomous 
variables. 


The MSA Method 


In brief, the Multiple Scalogram Analysis (MSA) method in- 
volves selecting an item from the set to be analyzed, finding that 
item among the remaining items which is most like it and having 
the fewest errors, determining the number of errors between the 
candidate item and all of its predecessors, and, finally, applying à 
statistical test of significance to adjacent item pairs. If both the 
- error and statistical criteria are satisfied, then the item that last 
entered the scale is used to find an item most like it, etc. Whenever 
either the error or statistical criterion fails, however, the scale is 
terminated and another scale is started with a new item chosen 
from among those that remain, until that point is reached where 
the item set is exhausted. All items are forced into a positive mani- 
fold and monotonicity of item marginals is insisted upon. Once ап 
item enters a scale, it is no longer considered for membership in 
other scales, i.e., a single classificatory system is employed for 
items in R-technique and for subjects in Q. 

The remainder of this paper will be concerned with definitions of 
terms, a logical exposition of the objective criteria proposed for 
linking items and for testing the scale hypothesis, proofs that the 
method will converge to any level of homogeneity desired, a more 
detailed presentation of the MSA algorithm, the assumptions under- 
lying the model, examples of analyses, and some critical observa- 
tions on the advantages and limitations of the proposed method. 


The MSA Criteria 


If one examines a perfect Guttman scale (see Table 1) and some 
measures based upon it, several important properties of this matrix 
of items and subjects are evident. The first class of properties of 
this matrix will be exploited for bringing items into the scale matrix, 


JAMES C. LINGOES 503 


TABLE 1 
A Perfect Guttman Scale 
ITEM 
S 1234 5  Beore 
1 о Ш) 5 
2 113111709 4 
3 11 1.0/0 3 
4 P7000 0 2 
5 1 00 0.5 1 
6 00000 0 
Sum б A ay e | 


ie, to provide the basis for the linking criteria, while the second 
class will provide us with a statistical concept of reproducibility. 


The Linking Criteria 


First, the matrix of phi-coefficients for the inter-item relation- 
ships will form a positive manifold, i.e., all of the items will be 
positively related to each other. Second, the marginal frequencies 
will be ordered from high to low, i.e., monotonicity of item mar- 
ginals will prevail. Third, the distance between adjacent and distinct 
sets will be minimal, as measured by the symmetric set differ- 
ence (Restle, 1959), i.e., adjacent items will be more like one an- 
other than they will be like more remote items. Fourth, errors (01s) 
Will be minimal for given distances. Although several additional 
properties of such scales could be listed, e.g., the super-diagonal 
form of the matrix of phi-coefficients or the tri-diagonal form of 
the inverse of the correlation matrix, these attributes are not di- 
rectly germane to establishing the criteria used in MSA. — 

The above four properties of a perfect Guttman scale provide the 
basis for bringing items together (linking) in the MSA technique. 
They are intended to substitute for the ranking methods used. in 
Scalogram analysis. The symbolic statement of these criteria receive 
Meaning from the cell and marginal designations of Table 2. 


Q) 4 > (4+BXA+D) or AC > BD, the criterion of 
oy N 
positive manifold; А мы 
(2) A + B > A + D, or B > D, the criterion of monotonicity; 
(8) (В + D),,,, the criterion of minimum distances between items 
or sets; and 


54 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT i 
(4) Dain the criterion of minimum errors within equal dista 
i.e., ties in (3) are broken on the basis of the D or error 
The third and fourth criteria used lexigraphically can be show! 
to be equivalent to minimizing: [(В + D)? + D]. 


The foregoing criteria are necessary but not sufficient condi 
for insuring a maximally homogeneous scale. A fifth criterion 
required to specify the limits within which one will accept the 
hypothesis and to guard against the possibility of producing а 
heterogeneous scale by paying attention to adjacent items only. _ 


The Reproducibility Criterion ў 

In this section we will develop a measure of homogeneity 
scalability which will completely specify the bounds of interrél 
tionship existing among all items of a scale. As a background а 
transition to this development, however, we will first explicate 
concept of reproducibility as advanced by Guttman and point í 
some of its shortcomings. 

Guttman has suggested that the coefficient of reproducib 
(REP) be .90 or above as one of the chief criteria for accepting 
scale hypothesis, where REP “. . . is secured by counting UP | 
number of responses which would have been predicted wrongly 
each person on the basis of his scale score, dividing these errors 
the total number of responses and subtracting the resulting fraction 
from 1” (in Stouffer, et al., 1950, p. 77), i.e.: 


(5) REP = 1 — Sum of Errors , 


Total Responses 
The difficulty involved in this formulation is that the manner 
counting errors is not made explicit in the measure. Suchman 


TABLE 2 
А Contingency Table 


j 
0 1 
en B A A+B 
t 
0 с р с+р 


JAMES C. LINGOES 505 


Stouffer, et al., 1950), in discussing the scalogram procedure, sug- 
gested the use of cut-off points for determining the error count. The 
method of ranking both respondents and items, however, because of 
possible tied ranks, may result in varying, albeit small, discrepan- 
cies from judge to judge as to the amount of error involved in any 
given scale. 

Goodenough (1944) has recommended а double-counting proce- 
dure for errors (counting 1’s that should have been 0’s and v.v.), 
which has been re-formulated by Lingoes (1960b) as a set-measure. 
This conceptualization has the advantage of clearly specifying the 
error count without relying upon ranked data and, more impor- 
tantly, it permits a generalization of reproducibility in terms of 
correlation. The proposed measure is: 


2j 5; lOs; p Gil 


(0 . REP =1 4i 
nm 
б-а ОИ 
^ = the number of subjects; 
m = the number of items; 
O — the n X m observed binary matrix of subjects and items; 
G =the n X т error-free Guttman matrix, which has been 


matched with О on the basis of the п row marginals (subject 
scores), where “error-free” is defined as a matrix having 
the following three properties: a) there are never more than 
two runs or sequences of 1’s and 0’s in any Tow; b) if there 
are two runs in any row, the ordering of 1’s and 0’s is 
consistent for all rows with two runs; and c) 0 and G are 
monotonically consistent; 

i = the rows (subject responses across items) of O and @; and 

Ј = the columns (item responses across subjects) of O and б. 


The numerator of (6) is actually the sum of the distances be- 
tween corresponding rows of O and @ over all subjects. This meas- 
ure of reproducibility is mathematically equivalent bee 
elements formula of the product-moment correlation coefficient, as 
applied to two binary matrices (cell-for-cell matching). This 
equivalence can be easily demonstrated by noting that the number 
of common elements plus the sum of the distances is equal to the 
total number of responses. 


Ш 


ll 


506 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Rather than use (6) on O directly, since one or more items may 
have disproportionate amounts of error and yet REP be .90 or 
more, it is necessary that the fifth criterion be developed on all 
item pairs in the scale matrix. In the special case where m = 2, 
the fourfold table relating any two items of О, say items 1 and j, 
as in Table 2, is sufficient to caleulate REP, i.e., (6) reduces to: 


(7) REP =1— E 

By requiring D < N, i.e., some constant proportion of N, for 
every pair of items, i and j in О, there can be at most m — 1 independ- 
ent sources of error, since the first item of a scale is by definition 
error-free. We can thus calculate the lower bound of REP, i.e., о, 
for а matrix of m items by the following formula: 


(8) а=1 2m - D, 
m 


Авт — c, a — 1 — 2e, and, therefore, we can specify to whatever 
degree desired (by making e sufficiently small) the limits within 
which we will accept the scale hypothesis. For example, setting € — 


10 would insure that every pair of items, û and j in 0, would be 


correlated (by common elements) with the corresponding items 


in G to the extent of at least .90 and, furthermore, the entire matrix 0 _ 


would be correlated with G at a level of .80 or more. To guarantee 
а REP of .90 for О, i& would be necessary to set the upper bound ' 
of D at .05N. Empirical evidence would indicate, however, that 
such a eriterion is unnecessarily restrictive and that quite respectable 
REPs can be obtained with the criterion of .10N. The REP calculated 


by (6), it might be mentioned, is a more conservative measur 


as pointed out by Edwards (1957), than is typically obtained PY | 
the ranking method of counting errors. | 
Although one could use D < .10N as the fifth criterion, & morê 
general model is suggested based upon the phi-coefficient. Indeed, it 
will be shown that REP and ¢ are related in a rather simple manner 
under special conditions. Let us first develop the formula for $ E | 
two items from O and the corresponding pair from G, where the pa | 
of items i and j in O are, in effect, treated as one item and the D 
from G is treated as the other. | 


| 
| 


JAMES C. LINGOES 507 


The formula for the product-moment correlation for dichotomous 
items (following the notation of Table 2) is given by: 


9 E AC — BD 
Фе ^ A+ ВХС + DB + O(A + D) 

In the special ease where the marginals are equal (p; = р), 

(9) simplifies to: 

40 205 
$u = (4 + ВС +D)’ 
which is the scheme of the formula wanted for pairs of items from 
both О and G, since they are matched matrices, i.e., the proportions 
of 1’s and 0’s in О equal those for б. 

The cells and marginals for the fourfold table relating two items 
from О and two from G by cell-for-cell matching ean be calculated 
from the table relating items 7 and j in О as follows: 

(11) Bos = Doo = Ра, 
1% (A+B), = (A+ Dy = (A + B TAF D), 
(1) (C+D), = (B + 0. = (C 0). ter [orm 


(10) 


(14) Ne, = 20, 
(15) A, = (А+В„— Diu, and 4 
(16) ‚ С = (С + D), TUS Di. 


Now substituting the known quantities relating items i and j of 
(11) through (16) into (10) and replacing фу, by $on We have: 
(17) 
в, — КА+В) (AED), Рас) B-- C), D4]- Dis, 
Р (AFB) + AFD) sll Р) iF BHO] 
Dropping the subscripts ij, which are implied in the following 
equations, and simplifying the bracketed terms, (17) becomes: 


1 _ QA 4 BC -- B) — Dr. 
(18) $« = GA +B + DQC +B + D 


A little algebra will show that (18) reduces to: 


(19) Numer. ру 
jd GA FB F DEC +B + 0) 


508 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


As both of the terms in the denominator of (19) approach N, 
$., approaches a maximum for a given value of D and as D approaches 
N/2 (its maximum, since B > D, the criterion of monotonicity), 
$. approaches zero. Thus, the relationship between pairs of items 
from О and from G varies between 0 and 1, as does REP. When each 
of the denominator terms of (19) equals N, i.e., when A = C, (19) 
reduces to: 

(20) Goo = =, 2D, 

Solving for D in both (7) and (20) we can thus state the relation- 

ship between REP and ¢,, in the two item case as: 

(21) $o = (ВЕР — .5), 

when the sum of positive endorsements (or negative) for items 
iandj = N. 

Formula (19) suggests а more sensitive criterion than D € .10N, 
inasmuch that for a given level of $,,, the number of errors would 
approach zero as one of the marginals (А + D) or (C + D) of 0 
approached unity. We could thus take a fixed lower limit of doo 88 
a criterion, which would yield varying values of REP as a function 
of the item marginals. Solving for D in (19), we would obtain: 


(22) Ds (1 = $,00A + B+ D)2C + B+ D) 
2N у 


for any two items of О. If we set our criterion for ġe = -80, then 
(22) would become: 


Q3) p < 10A B + D)QC +B + D) 
< x А 


for all item pairs in О. 

The lower bound of REP, a, now becomes a function of m, D, 
and the marginals of items 1, --- , m, approaching .80 аза limit 
as m — o, D — .10N, and the item proportions pı, *** , P» 7 5, 
Equation (23) thus permits one to caleulate the lower bound of 
REP for any given set of marginal values and a fixed N by inserting 
the proper D values in the following expression: 


22 (4) 


(24) a DESIT т (21, ,m- 1), 


JAMES C. LINGOES 509 


where: A, = the D value of the i* pair of adjacent items in O. 

Thus, for example, if one had nine items with positive response 
proportions of : .9, .8, .7, .6, .5, .4, .3, .2, and .1 and N = 100, the 
m — 1A,’s would be: 5.1, 7.5, 9.1, 9.9, 9.9, 9.1, 7.5, and 5.1. These 
D values when rounded and summed would equal 64, which would 
have to be doubled since: 


(25) È (4) = 5[2 >; Ou — Gall; 
G=1, =m — r ‘mR ım), 


when errors are independent. Substituting the obtained values of 
errors in (24) we have: 


2X 64 _ 
BETES Wane © 


which contrasts with the value obtained by applying the uniform 
criterion of D < .10N, i.e., а = .822, approximately. 

To conclude this topic on the fifth criterion of reproducibility, we 
have: a) provided a statistical measure which is sensitive to differ- 
ences in item marginals, i.e., (19), b) shown the functional relation- 
ships between the set-measure implicit in Goodenough’s method of 
counting errors, the common elements correlation coefficient, and 
the phi-coefficient, c) given equations, i.e., (8) and (24), for caleu- 
lating the lower bound for REP for a given set of items and a fixed 
N, and, as a consequence, completely specified the limits of homo- 
geneity for all item pairs in the scale, and finally, d) demonstrated 
that the systematic application of the here proposed fifth criterion, 
le, (22) will converge to whatever level of homogeneity desired 
for any set of items by making фор sufficiently large and inserting 
the calculated D’s in (24). 

Before detailing the MSA algorithm, 
might be added in the nature of a statisti 
be desirable as a protection against accepting the scal 
when either N is too small or the marginals are too extreme. 


(26) а = 1 


а tentative sixth criterion 
cal test. Such a test would 
e hypothesis 


The Statistical Criterion 


The problem of selecting the most ap 
difficult, one and depends to a considera’ 
tions you wish to make and equally imp 
you are asking (Goodman, 1959; Sagi, 1959). Th 


propriate statistical test is a 
ble extent on what assump- 
ortantly on what questions 
e entire question is 


510 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


further complicated by the fact that “the conditions under which 
tests of significance apply define prior admissible scaling operations. 
Conversely, different scaling operations usually imply different 
sampling distributions of the evaluative statistic. The implication 
is immediately apparent: either scaling operations are restricted to 
those operations consistent with developed statistical tests, or exten- 
sive replication is used as a substitute testing procedure” (Sagi, 
1959, p. 26). Integral to the scaling operations proposed, we would 
like an answer to the following question: Given the observed item 
marginals of the m-1 adjacent item pairs for each MSA scale, what, 
under the assumption of independence, are the probabilities of ob- 
taining the observed D values? 

A special computer program was written to match the item 
marginals in several empirical matrices, randomly assigning 1’s and 
0’s, and 1,000 samplings were made for each pair of adjacent items. 
The distribution of errors was found to be hypergeometric for 
which Fisher's exact test is appropriate. The computation of exact 
probabilities, even with computers, is a slow process when N is 
large. Consequently, the chi square approximation was settled upon 
at a level of .001 as the sixth criterion, since this test is sensitive to 
both sample size and marginal values. 


The MSA Algorithm? 


Although the MSA method is rather simple computationally, 4 
moderately sized sample of subjects and items, say 100 of each, 
could be both expensive and time-consuming. Programmed for à 
high-speed computer, however, time and expense are small indeed 
(Lingoes, 19622). The steps outlined below, with their associated 
comments, constitute the MSA algorithm. 


2 The method as reported in this article differs from that of the basic work 
(Lingoes, 1960b; 1961) in the following fundamental respects: a) (23) was 
substituted for D = ДОМ; b) the criterion of a positive relationship wa 
changed from A я; Cm N/2 to AC = BD; c) previously ties were broken 
by choosing that item having marginal values closest to preceding items, which 
stands in contrast with the present fourth criterion; d) the second criterion of 
monotonicity was not insisted upon in the earlier version; e) unanimous items 
were excluded in the present model but not in the original; f) three or more 
items formerly were required to define a scale, which has been changed to {#0 
or more; and g) an additional, ancillary statistical test has now been incor 
сае in the process of scale formation, which was done after the fact in the 

version, 


JAMES C. LINGOES 511 


STEP 1—Matriz Preparation: Having selected the items to be 
analyzed, set the responses out in matrix form, where the rows rep- 
resent subjects (1, . . ., n) and the columns, items (1, . . ., m). 
There are no restrictions on the specificity or generality of the set of 
items. They must, however, be in dichotomous form. “Trues,” 
“passes,” “yeses,” etc., are generally denoted by a “1,” while their 
opposites are represented by a “0.” 

STEP 2—Matrix Reflection: Calculate the number of positive 
responses (Sum) for each item, eliminating unanimous items and 
reflecting (ones-complementing) all items whose marginals are less 
than n/2. The marginal sums of all reflected items simply become 
n — Sum. 

STEP 3—Scale Initialization: Choose that item with the largest 
marginal Sum, breaking ties arbitrarily, as the initial item of any 
scale, by-passing items that have previously entered scales or have 
remained unclassified by virtue of failing either the reproducibility 
or statistical criterion. If all items have been scaled or accounted 
for, the analysis is completed, otherwise proceed. | 

Given а set of items which form a perfect unidimensional scale, 
the largest set of items will be included in a scale when the scale is 
started with items from either of the two extremes (see Table 1). 
For example, if one started with the m/2™ (middle) item, this 
would bring in the m/2 + 1" item, which would, in turn, link with 
the m/2 -4 22^ item, ete., until the m item entered the scale. Link- 
ing this last item with any of the remaining items would either 
break the monotonie decreasing series of item marginals or would 
introduce error. On the other hand, starting with either the first or 
last item would perfectly reconstruct the scale, one the mirror image 
of the other. Based on this logie, extreme starting points vill yield 
larger, fewer, and more reproducible scales than any other point. An 
investigation of several empirical matrices, starting at every pos- 
sible point (with reflection and without), has confirmed the sound- 


Ness of this heuristic. 


STEP 4—Item Chaining: After forming the 2 X 2 tables whose 


members are the last item that successfully entered the scale and 
-+ D for all pairings 


each of the remaining items in turn, calculate: B 


where AC > BD, skipping over any items where B <D,orif AC < 


BD, calculate A - C and by-pass pairings where A < С, On the 
е ыи the candidate for the 


basis of these caleulations, choose that item as 


512 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


scale which has the smallest distance with its immediate predecessor 
(breaking ties on the basis of the D or C cell, respectively, for the 
two cases). If no candidate can be found, return to STEP 3, other- 
wise continue. 

STEP 5—Scale Termination: a) If any 


D« 124 + B+ DOC BD 


EJ 
for the 2 X 2 tables formed from the candidate item and all of its 
predecessors in the scale (making the necessary substitutions in the 
case of à negative relationship), terminate the scale and return to 
STEP 3; otherwise make the following statistical test. 

b) If x < 10.83 (representing a probability at or less than .001), 
between the candidate item and its immediate predecessor, 
terminate the scale and go back to STEP 3; otherwise proceed. 

в) If AC < BD, reflect the candidate item and return to STEP 4; 
otherwise go to STEP 4 directly. 

The above five steps represent a complete MSA. If one desires 
to calculate REPs for the matrix of scales, for the individual scales, 
items, or subjects, one can do so at this point. The logic of the MSA 
technique, however, in no way depends upon their calculation. The 
computer program (Lingoes, 1962a) prints out all but the item 
REPs. In addition, the program provides an option for subdividing 
the sample of subjects on the basis of an error analysis, thus making 
possible multiple classification as well as single. The intercorrela- 
tion of scale scores can be used for testing the independence of the 
derived MSA scales. 

As can be seen from the above description, MSA is a stepwise 
procedure for obtaining homogeneous classes of variables, which 
differs in both philosophy and method from either Guttman's scal- 
ogram technique or the exhaustive, combinatorial method of Schutz 
(1961). 

Prior to giving some illustrative examples of MSA scales, à brief 
excursion will be made into some of the underlying assumptions of 
the model and how it compares with Guttman’s method. 


The MSA Model 


_ The basic approach of MSA can be considered to be typo-dimen- 
sional (after McQuitty, 1955). Items are analyzed in such а AY 


N ЦНИИ = - - : » 
—— aá€—— c —]—— À—— —— аан. n ННЦ or cat iE 


JAMES C. LINGOES 513 


(dimensionalized) as to discover the basic alignments or groupings 
of subjects (typology) in respect to the items entering a scale. Of 
course, the converse is true for Q-analysis. The technique was 
derived in connection with an analysis of senatorial voting be- 
havior (Lingoes, 1960b), but was found to have wider generality. 
For instance, one could consider individuals lying at the extremes 
of а dimension defined by MMPI items as having the same logical 
status as oppositely voting senators on a particular set of issues, 
eg., farm related. That is to say, in both instances, that involving 
senators and that of a mixed sample of patients and community 
subjects, a latent class (Lazersfeld, 1950) or a type '(MeQuitty, 
1957) is tentatively defined. For subsets of items in а unidimen- 
sional seale, an hierarehy сап be determined, where alignments 
change as more and more extreme items or issues are encountered. 
We thus have, in effect, a dimensionalization of types. 

The MSA model is completely deterministic like Guttman’s 
(Torgerson, 1958) and assumes that the items to be analyzed are 
either of the cumulative type or can be made to conform to the 
characteristics of such monotone items. It is further assumed that 
the directional nature of the items is a matter for investigation 
rather than fiat. Like factor analysis, items are brought together 
regardless of their direction, the magnitude of the relationship and 
concordance with the model being the primary considerations. Un- 
like factor analysis, however, the method is not bound to linear 
assumptions about the regressions involved and does not insist upon 
mutually high or clustered relationships among all the members of 
the set of variables defining a dimension. 


MSA and Scalogram Analysis 


Multiple scalogram analysis differs from Guttman’s scalogram 
analysis in several important respects. First, the concept of a “uni- 
verse of content” is not necessary to MSA, as it is to Guttman’s 
method. This issue is by-passed by allowing the data to form what- 
ever relationships are implicit, consistent with the logical and 
statistical requirements of the procedure. Although scales are con- 
structed independently of any а priort considerations of meaningful- 
ness, MSA results in scales which have, however, all the statistical 
Properties of unidimensional scales, but obviates the possible criti- 
cism of having selected or constructed items which have the greatest 


514 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


probability of fitting the Guttman model. In other words, a model is 
tested rather than imposed. 

Another important difference between the two methods lies in 
the fact that Guttman uses what might be called a piecemeal ap- 
proach, ie., an experimenter selects a set of items as pertinent to 
some “universe,” tests for dimensionality, and if certain criteria 
are met, accepts the universe as being scalable. If he desires to test 
another set of items for scalability, the above procedure is repeated. 
MSA, on the other hand, takes a sample of items and attempts to 
minimize the number of scales for a given set of relationships. 

The above differences between MSA and scalogram analysis can 
be summarized by stating that MSA: a) is empirical rather than 
rational in determining scale membership; b) has the capacity for 
yielding multiple scales when the data demand it, rather than re- 
jecting the scale hypothesis for the set when treated as a whole; 
and с) has a statistical rather than an heuristic decision basis for 
both grouping items and for testing the scale hypothesis. 


MSA Examples 


In this section two illustrations of MSAs will be provided cover- 
ing a hypothetical and an empirical matrix. 


A Hypothetical Two-Dimensional Case 


The first example is a hypothetical “1,0” data matrix, whose 
underlying dimensionality is two. The problem posed for the MSA 
method is to recover these two unidimensional scales such that from 
a knowledge of the order of the items and the subjects’ scores all 
item responses could be reproduced without error. 

Given the data matrix appearing in Table 3, a standard scal- 
ogram analysis of these 8 items and 25 subjects resulted in а REP 
of .70, which was identical to the minimal marginal reproducibility 
(MMR) of the matrix, the lower bound for REP based upon modal 
item marginals. This latter coefficient is simply calculated by sum- 
ming the values appearing in the row Sum of the reflected score 
matrix and dividing by the product nm (the number of subjects 
times the number of items). MMR represents the reproducibility 
of the matrix using a knowledge of the item proportions only. Thus 


for example, one could reproduce at least 80 per cent of the re _ 


sponses to the first item in Table 3 by knowing that this item had 


JAMES C. LINGOES 515 


TABLE 3 
A Hypothetical Two-Space Matrix 


————————— 
ITEM 
4 


m 
= 
Y 
© 
e 
о 
E 
ою 


co мчс олњ о о س‎ 


© 
CHB RR HH ORB pt Ф i ү К КК Oop H M O 
corococoocrororoocoocoorecsed 
ннооонон оон оноооноонон оно 
о-н онннон онон он нн оон нон онн 
нын оооны нон он ннн оон он о о 
AMA A а © ка ка HOR RRR ROR ROR ке ка ке а ке О 
coocoroocoooorococ“|ce|coororoor 
SCOORR OR онооноонноноонооон 


an 80/20 split: The verdict of a scalogram analysis in the present 
instance was that these eight items do not form a scale, i.e., do not 
represent a single universe of content. 

Rather than either discarding these items as nonscalable or 
manipulating them by combining items, there exist other alterna- 
tives, e.g., multiple factor analysis, & nonmetrie factor analysis 
(Coombs & Kao, 1955), Loevinger's method of homogeneous tests 
(1948), etc, Results and a discussion of these alternatives can’ be 
found elsewhere (Lingoes, 1960b; Wilkins & Wrigley, 1961; E 
1962) and will, therefore, not receive detailed treatment ке" 
fice it to say that each of these methods has its own associate 
problems when presented with a matrix such as that appearing m 
Table 3, e.g., yielding more complex solutions. A MSA of this ma- 
trix, however, recovered the two orthogonal unidimensional scales 
depicted in Table 4 with a REP = 1.00. 


Items 3 and 6, in the first scale, and 2 and 5, in the second, were 


516 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 4 
Two Orthogonal Guttman Scales 
——————Є—— 
ITEM 
1789-4 па 

8 Score Score 
E i LLLÁ—————— 

1 Pelee 4 13 11:1 4 

2 0000 0 B1 3 

3 1; 159.0 2 15050 0 1 

4 1000 1 ; 1b a ES! 4 

5 1:15.60 3 0000 0 

6 0202030. 1.0 О УКЕ" 

7 г 070 2 Don 1.0 3 

8 ИКЕ 4 1000 1 

9 1000 H 1000 1 
10 D-20150 8 1100 2 
11 kiki 4d 4 inf 9-0 2 
12 1 1:00 2 25711 0.0 2 
13 0000 0 0000 0 
14 I 173-0 Б] 13110 3 
15 1000 1 0000 0 
16 J E OF 0 2 171 4 
17 11424 4 0000 0 
18 0000 0 1100 2 
19 1T M MEO 3 1000 1 
20 150/19 0 1 oS EE E 3 
21 1—1 $1 4 13:10 3 
22 d 121.0 3 DEL. 1 4 
23 dg 00 2 0000 0 
24 1000 1 TEN OTO 2 
25 0000 0 1 0.0 0 1 


reflected in the process of achieving these scales. Although the 
scales in the present example were orthogonal, such а result is not 
in general to be expected. More often than not correlated scales are 
obtained empirically, requiring other techniques, such as factor 
analysis, to settle the issue of dimensionality (Lingoes, 1962b; 
19624). Lest it be concluded that in this event the MSA procedure 
is an unnecessary preliminary, the reader is reminded that а factor 
analysis at the item level for a set of scalable items will overesti- 
mate the true dimensionality by producing "difficulty factors." MSA 
is a data-reduction method operating on the actual responses (manl- 
fest structure) rather than employing some abstraction from the 
data as represented by correlation coefficients and factor loadings 
These are some of the reasons why the qualifier “multiple” rather 
than “multidimensional” was used in the name of the technique. I 


orthogonal scales exist in the data, the method is adequate to PIT | 


duce them, but an oblique structure for partially-ordered data 38 


JAMES C. LINGOES 517 


TABLE 5 
A Guttman Scale of U. S. Senators, No. 1 


A 
Senator %-Yes 
Кее a К 


Humphrey, Hubert Н. (D Minn.) 
Murray, James E. (D Mont.) 
Jackson, Henry M. (D Wash.) 
Mansfield, Mike (D Mont.) 
Kefauver, Estes (D Tenn.) 
Symington, Stuart (D Mo.) 
Hennings, Thomas C., Jr. (D Mo.) 
Kerr, Robert S. (D Okla.) 
Clements, Earle C. (D Ky.) 
George, Walter F. (D Ga.) 
McClellan, John L. (D Ark.) 
Smathers, George A. (D Fla.) 
Holland, Spessard L. (D Fla.) 
Byrd, Harry Flood (D Va.) 

Bush, Prescott (R Conn.) 
Flanders, Ralph E. (R Vt.) 
Knowland, William F. (R Calif.) 
Bridges, Styles (R N. Н.) 
Hickenlooper, Bourke B. (R Towa) 
Dirksen, Everett M. (R Ш.) 


SESSSESBSESSUSSESEEPRSS 


the more likely result. The method can be characterized as yielding 
results which lie somewhere between the raw data (in its fineness of 
detail) and factor analysis (with its emphasis on the gross outlines 
of structure). 


Voting Behavior 

As an empirical example, two of six MSA scales will be presented 
based upon fifty selected voting issues for 88 Senators of the 83rd 
United States Congress (Lingoes, 1960b). This time, however, the 
analysis will be done in Q-technique since most of us know more 
about senators than about the issues on which they vote. The two 
scales selected for illustration appear in Tables 5 and 6. 


TABLE 6 
А Guttman Scale of U. 8. Senators, No. 2 
Senator %-Үев 
Magnuson, Warren С. (D Wash.) le 
Morse, Wayne (I Ore.) ш 
Monroney, А. 8. Mike (D Okla.) = 
Hill, Lister (D Ala.) 9 


Sparkman, Jobn J- (D Ala) SS 


518 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


The REP for the first scale of 20 senators was .959, while that for 
the second scale of five was .967. The MMRs were .761 and .916, 
respectively. A plausible interpretation of the first scale is that it 
represents the left-right or liberalism-conservatism dimension of 
U. S. Senators. The second scale, however, contained members who 
were at one pole of this dimension only, i.e., liberal democrats. The 
ordering in these two scales appeared reasonable enough, confirm- 
ing that the MSA method is particularly appropriate in the area of 
legislative voting (see, for example, Brown, 1962; Brown & Wrigley, 
1961; Knudsen, 1962; Lingoes, 1960b; 1962b; 1962d). 

Before concluding, it might be noted that the fifty issues form- 
ing the basis for the above analysis were selected from a larger 
set of 128 roll call votes. The means for selecting these fifty items 
was provided by an error analysis. These fifty items were, as à 
consequence, those which contained relatively little error, i.e. had 
REPs = 90. Such a selection process introduces additional prob- 
lems of interpretation, e.g, а Q-analysis on all 128 issues would 
show a negative relationship between the poles of this dimension. 
What then do the fifty items have in common to produce a com- 
pletely unreflected scale? Or, why do these same senators scale 
differently when the remaining 78 items are used? There is not 
space to discuss this topic here. It is necessary to point out, how- 
ever, that an R- and a Q-analysis using the MSA method are 
reciprocal only when the underlying dimensionality of the items and 
the people is the same. Reciprocity is palpably true for any n° 
unidimensional scale, but not necessarily true for multiple uni- 
dimensional scales. A comparison of R- and Q-analysis for sena- 
torial voting issues revealed that the items scale much better than 
the subjects and that the dimensionality is greater in the former 
than in the latter. The only justification for selecting these fifty 
issues as а base in the present illustration was that of not over- 
burdening the reader with unnecessary complexities. 


Discussion 
Although the MSA method seems to be quite versatile and meets 
two of the chief criticisms directed against Guttman’s method, 1.62 
а) the method of selecting items, and b) the quasi-statistical oF 


teria used for scaling, some critical observations are in order. — . 
First, as Restle (1959) has stated elsewhere but is quite approPr- 


p JAMES C. LINGOES 519 


їп the present context, it is quite possible for two or more pure 

bles to be related in such a way that they cannot be isolated 
"the basis of the formal internal evidence of a particular method 
nalysis. In another article (Lingoes, 1960a), the importance of 
s sometimes unavoidable limitation had been emphasized relative 
| {һе results of factor analyses and was given the label of the 
ntity error. The MSA method is not immune from this criticism 
since it is possible to bring items together from what appear to be 
‘conceptually distinct domains. A number of examples come to mind. 
Опе could easily find a sample of subjects (e.g., those with low 
195) in which, for a given set of arithmetic problems and words to 
“be spelled, each subject could spell no better than he could solve 
Problems. In some other sample, however, these two abilities would 
mot correlate very highly. Such arguments as the foregoing may 
| have been instrumental in Guttman's insistence upon à “universe 
of content” as a major criterion of item selection. In any event, 
8004 judgment and external evidence cannot be dispensed with 
since any empirical or statistical method of analysis, by its very 


r formulations about domains by inclu ux. 
"thought to be different areas. If these items appear in the a priori 


the same area of con- 


у tent (see Stouffer, et al., 1950, for other uses of scale analysis). 


Second, the fundamental issue regarding the generality of the 
‘cumulative model for other kinds of data, e£ Thurstone scale 
Kao's (1955) nonmetric 


- items, and other models, e.g., Coombs and 
- disjunetive-conjunetive and compensatory models, needs to be 
systematically explored. It has been found, for example, that MMPI 
items do not scale well, probably as а function of the infrequency of 
- endorsement in the pathological direction for a large number of 


| _ items even in patient samples. 


Third, although a large amount of psychological data is binary, 
much work yet remains to be done in the generalization of the 
A beginning in this direction 
1 for resolving a correla- 


520 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


tion matrix into multiple simplexes. Another approach is afforded 
by the demonstrated relationship between reproducibility and cor- 
relation. It should be quite possible to correlate an observed matrix 
with more than two categories with a model matrix defined similar 
to that for binary data. Yet another avenue of approach being 
tested is use of the MSA method as presently constituted on 
dichotomized continuous variables. The results of the analysis are 
then compared with those issuing from analyses using the full range 
of information, e.g., factor analysis. In the basic study proposing 
MSA (Lingoes, 1960b), it was found that MSA did slightly, although 
not significantly, better than multiple factor analysis in predicting 
voting behavior. This study, however, cannot be considered crucial 
and additional research is needed. 

Fourth, the meager beginnings adumbrated in the present report 
on subdividing samples of subjects or items on the basis of error 
analysis requires a more rigorous treatment. To separate the often 
confounded sources of error variance is no small task. It, might be 
mentioned here that reliable error can be partialed out by such à 
strategy. For example, in one study (Lingoes, 1960b) it was found 
that, over independent sets of issues and over а narrow range of 
REPs for 88 senators, the test-retest reliability coefficient for REPS 
was .63, which while below acceptable standards is nonetheless ap- 
preciable and suggestive of fruitful results using this approach. 

Fifth, and final, some comments should be made on the reliability 
of the MSA procedure as а function of: a) changes in the scaling 
criteria and methods; b) changes in the stimuli or items holding 
subjects constant; and c) changes in the subjects but using the 
same variables. In discussing these topics it is important to note 
whether we are talking about error-free data or the more commonly 
encountered empirical data about which we would like to make some 
generalizations. Proofs are often possible in the former case but not 
the latter. For empirical data it is often necessary to offer experi- 
mental evidence in the place of proofs. This we shall do. 


Same Data, Different Criteria 


Evidence based on both error and error-free data is available o? 
this point. Using error-free data it has been found that the minimum 
criteria of positive relationships, smallest distances, and fewest 
errors within equal distances for adjacent item pairs, when us 


JAMES C. LINGOES 521 


in conjunction with STEPS 2 and 3, will produce the same scales, 
regardless of the differences between the remaining criteria (exelud- 
ing the statistical test), if one terminates when any error is intro- 
duced. : 

Error data, however, are much more sensitive to variation in сгї- 
teria and procedures. The senate data (Lingoes, 1960b) has been 
analyzed using the criteria proposed in this paper and the original 
criteria (see footnote 2). The differences in the two analyses were: a) 
more scales with the present criteria, with fewer items and higher 
reproducibilities; and b) very few scales which had perfect overlap. 
The total number of items which scaled and the over-all reproduci- 
bility were, however, quite close. The really important comparison 
involves the factorial structures of the scales from the two analyses. 
Here it was found that despite the small item overlap, scale by scale, 
the same two basic dimensions of domestic and foreign issues were 
recovered. 

A number of analyses, similar in outeome, have led to the con- 
clusion that stability should be sought in the relationships among 
the scale scores and not in the presence or absence of any particular 
item or set of items in a given scale when comparing scales using 
the same data, The most stable factor solutions will be obtained 
when the average level of difficulty and the average level of repro- 
ducibility is preserved for those scales defining any particular factor, 
even though item overlap differs. 


Same Subjects, Different Items 


The reader is referred to the cross-validation study by Lingoes 
(1960b) for detailed evidence on this point. In summary, the find- 
ings were that stable results can be achieved using different data 
(from the same domain in this instance, i.e., voting behavior), both 
in terms of individual scales defining a set of issues (e.g., farm, 
foreign aid, ete.) and in terms of the factors issuing from ап analy- 
sis of scale scores. Here, in contrast to the previous topic, the same 


method of analysis was employed. 


Same Items, Different Subjects 

In this section two points will be made. First, even though per- 
fect item overlap might not exist between scales derived on different 
subjects, it is highly likely, the more homogeneous the scales and 


522 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


the larger the differences between adjacent item marginals, that 
both scales will cross-validate when used on the sample not used in 
derivation. Such results are to be expected and are not peculiar to 
MSA scales, but to Guttman scales in general. Furthermore, the 
Stronger and the more significant are the relationships, the more 
likely it is that item overlap will exist between scales based upon 
different samples. To require perfect item overlap in two studies of 
scale analysis is tantamount to requiring not only the same factors 
in factor analysis, but identical factor loadings. 

Second, although item overlap between scales falls short of per- 
fection, it is still possible to replicate the factor structures based 
upon the scale scores in different samples. This has been done on 
an MSA of MMPI items with results that compared favorably to 
analyses of empirical scales (Lingoes, 1960a). 

In conclusion, it seems a fair assessment to say that MSA is at 
the least more objective than scalogram analysis and can be used 
wherever the latter is applicable. At the other end of the scale, the 
present method of analysis is an important extension of Guttman’s 
approach, which may prove helpful in increasing understanding and 
prediction using homogeneity principles and an “if . . . then" or 
contingency model, rather than the conventional “if and only if” 
model implicit in much of multivariate research. 


REFERENCES 


Brown, Alicia. “A Multiple Scalogram Analysis of the United Na- 
Нш Unpublished М.А. thesis, Michigan State University, 


Brown, Alicia and Wrigley, C. F. “A Multiple Scalogram Analysis 
of the United Nations” A paper read at the 1961 Midwestern 
Psychological Association Convention, Chicago, Ill. 3 

Coombs, С. Н. and Kao, В. C. Nonmetric Factor Analysis. Engineer- 
ing Research Bulletin, No. 38, 1955. 

Edwards, A. L. Techniques of Attitude Scale Construction. New 
York: Appleton-Century-Crofts, 1957. 

Festinger, L. “The Treatment of Qualitative Data by ‘Scale Analy- 
sis.’ ” Psychological Bulletin, XLIV (1947), 149-161. 


Goodenough, W. Н. “A Technique for Scale Analysis.” EDUCATIONAL 


AND PSYCHOLOGICAL MEASUREMENT, ТУ (1944), 179-190. 
Goodman, L. A. “Simple Statistical Method: for s вну: Analy- 
sis.” Psychometrika, XXIV (1959), 29-43. , 
Guttman, L. “A Basis for Sealing Qualitative Data." America” 

Sociological Review, IX. (1944) , 139-150. ” 
Guttman, L. “Relation of Scalogram Analysis to other Techniques 


JAMES C. LINGOES 523 


In Stouffer, et al., Measurement and Prediction. Princeton, N.J.: 
Princeton University Press, 1950 (p. 172-212). 
udsen, Karel. “A Comparison of Two Methods, Multiple Scal- 
> ogram Analysis and Factor Analysis, for Analyzing United Na- 
€ 251 tions Voting Behavior." Unpublished M.A. thesis, Michigan 
_ State University, 1962. 
Lazersfeld, P. F. “The Logical and Mathematical Foundation of 
Latent Structure Analysis.” In Stouffer, et al., Measurement and 
Prediction. Princeton, N. J.: Princeton University Press, 1950 
> (р. 362-472). 
ее, J. С. “MMPI Factors of the Harris and the Wiener 
ubscales." Journal of Consulting Psychology, XXIV (1960), 
74-83. (а) T 
.. Lingoes, J. C. “Multiple Scalogram Analysis: A Generalization of 
—— — Guttman's Scale Analysis.” Unpublished Ph.D. thesis, Michigan 
‚ State University, 1960. (b) 
Lingoes, J. C. “Multiple Scalogram Analysis for MISTIC, ILLIAC, 
_ SILLIAC, and CYCLONE.” Behavorial Science, VI (1961), 97. 
Lingoes, J. С. “Multiple Scalogram Analysis: An IBM 704/709/7090 
—. Program." Behavioral Science, VII (1962), 126. (a) 
“Lingoes, J. C. “A Multiple Sealogram Analysis of Three Sets of 
- 4 К" Voting Issues." Michigan Psychologist, XXI (1962), 
д B. J. C. “Information Processing in Psychological Research." 
__, Behavioral Science, VII (1962), 412-417. (c) 
| Lingoes, J. C. “A Multiple Scalogram Analysis of Selected Issues of 
о 83rd U. S. Senate.” American Psychologist, XVII (1962), 
7. (d) 

Loevinger, Jane. “The Technic of Homogeneous Tests Compared 
with Some Aspects of ‘Scale Analysis’ and Factor Analysis. 
__,, Psychological Bulletin, XLV (1948), 507-529. : 
MeQuitty, L. L. “А Method of Pattern Analysis for Isolating Typo- 

logical and Dimensional Constructs.” Research Bulletin, Tn- 
Eo San Antonio, Texas: Hdqs. Air Force Personnel and 
raining Center, 1955. š А 
MeQuitty, 1А L. “Elementary Linkage Analysis for Isolating Or- 
thogonal and Oblique Types and Typal Relevancies. EDUCA- 
E AND PsYcHoLocIcaL MEASUREMENT, XVII (1957), 209- 
i B F. “А Metrie and an Ordering on Sets.” Psychometrika, 
IV (1959), 207-220. : ; Я 
Sagi, P. C. “A Statistical Test for the Significance of a Coefficient 
of Reproducibility.” Psychometrika, XXIV (1959), 19-27. . 
“Schutz, W. C. “BC GUTS-GUTtman Scaling.” SHARE Distri- 
bution No. 1337 (Programmed by: Krasnow, Eleanor 8.), 1961. 
Stouffer, S. A., Guttman, L., Suchman, E. A., Lazersfeld, P. F., 
Star, Shirley A., and Clausen, J. A. M easurement and Prediction. 
Princeton, N. J.: Princeton University Press, 1950. . Joh 
Torgerson, W. S. Theory and Methods of Scaling. New York: John 
Wiley & Sons, 1958. 


524 EDUCATIONAL AND PSYCHOLOGICAL MEASUREM 


Wilkins, D. M. “Factor is and Multiple Scalogram Ai 
ыр and Empirical .” Unpublished Ph.D. 
State University, 1962 


irical Analysis.” A paper read at the 1961 Р 
CE aede New York, New York. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 3, 1963 


DIMENSIONS OF OCCUPATIONAL PREFERENCE! 


JAMES S. TERWILLIGER 
Educational Testing Service 


Tuts study represents an attempt to isolate some of the dimen- 
sions which underlie the occupational preferences of male college 
Students. The importance of such preferences for purposes of voca- 
tional guidance and counseling has long been recognized. However, 
there have been few attempts to systematically describe these pref- 
erences in the context of a generalized model. Further, there have 
been even fewer attempts to demonstrate the applicability of а 
multidimensional model to occupational preferences. It is hoped 
that the present study will demonstrate that: a) the multidimen- 
sional model is appropriate for preference data, and b) the dimen- 
sions which result from our analysis are psychologically meaningful. 
A secondary objective will be an attempt to relate more conven- 
tional ability, interest, and value measures to the results obtained 
from our dimensional analysis. 


Method 


A systematic method for obtaining generalized perference judg- 
ments has recently been suggested by Gulliksen and Tucker (1961). 
They call their procedure the method of multiple rank orders. The 
method was “derived to reduce the number of judgments required 
of the subject while obtaining information on each possible paired 


gestions and criticisms during the course of the research. Acknowledgment is 
also due to the United States Public Health Service for financial support 
during the year in which the study was conceived and executed. 


525 


56 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


comparison." The method involves combining the stimuli in subsets 
во that each stimulus appears in the same subset with each other 
stimulus once and only once. The subject is then asked to rank the 
stimuli within subsets. The resulting rankings provide all of the 
information available from a complete paired comparison method 
since all possible stimulus pairs are imbedded in the multiple rank 
format. The technique of combining stimuli into subsets as here 
described is formally known as a balanced incomplete block design. 
Such designs are described in detail by Cochran and Cox (1950). 
Several possible designs are available. The design which is used de- 
pends upon the total number of stimuli and the number of stimuli 
desired in each subset or block. 

One design chosen for the present study has been termed a “6-31” 
design. This design allows for a total of 31 stimuli. The stimuli are 
grouped into 31 blocks of six so that all possible paired comparisons 
are presented once. The subjects are instructed to rank order the 
stimuli within each block. This design was used in the construction 
of two of our measures. A second design which calls for a total of 
13 stimuli grouped into blocks of four was used in constructing two 
other measures. 

The subjects are also asked to give a categorical positive-nega~ 
tive response to each of the stimuli in a separate section of each 
questionnaire. The categorical response is used in order to establish 
an absolute standard for each subject. This, of course, is not pos- 
sible given only the information available from the multiple no 
orders. 


4 
Experimental Questionnaires * 
The first questionnaire requires preference rankings of occupa- 
tions. The stimuli consist of 31 occupational titles which = 
selected so as to be representative of the types of occupations n a 
mally considered as careers by college trained males. The qu Я 
naire was constructed with а 6-31 design in which each stimulus 
appears а total of six times and is compared directly with each 
other stimulus once. The subjects gave a categorical “like-dishike 
response to each stimulus in addition to ranking them for 
ence.? 


2 Alternate forms of the questionnaire, differing only in the way in which 


ў Es 


JAMES 8. TERWILLIGER 4 57 

A second questionnaire requires rankings of occupational pres- 
tige. The questionnaire is identical to the preference questionnaire 
described above except for instructions for responding. In this case, 
the instructions are to rank the occupations according to the "pres- 
tige or status” which the subject attributes to them. The subjects 
also gave a categorical “high-low” prestige response to each o0- 
eupation.? 

A third questionnaire is called the Goals of Life Questionnaire. 
` This questionnaire is based upon а 4-13 design in which 13 stimuli 
are combined into blocks of four so that all possible paired com- 
parisons appear once. The stimuli are statements which express 
general goals of life, e.g., “Gaining personal immortality in heaven.” 
These statements were taken from a similar questionnaire which 
has been previously published (Cooperative Test Division, Educa- 
tional Testing Service, 1950). The subject is instructed to rank the 
goals in terms of their desirability and is also required to make а 
categorical “desirable-undesirable” response to each. 

A fourth questionnaire, termed the Job Attributes Questionnaire, 
is also based upon a 4-13 design. The stimuli are job attributes such 
as “authority,” “making important decisions,” and “serving others.” 
The instructions are to rank the stimuli according to relative im- 
portance of each in choosing a job. The categorical response section 
of the questionnaire requires an “important-unimportant” response 
to each stimulus. 


Other Measures 


The four questionnaires described thus far were administered in 
a single testing session. The Allport-Vernon-Lindzey Study of 
Values questionnaire, 1960 edition, was also administered during 
the same testing period. This instrument has been used extensively 
for research purposes. A thorough deseription of earlier editions of 
the instrument is available elsewhere (Buros, 1959). 
| Other scores were obtained for the subjects through the coopera- 

tion of the Student Counseling Bureau. These include scores on the 

Cem NGC E E A NS 


stimuli were combined into blocks, were constructed. Each form was admin- 

istered to one-half of the total sample and the responses were analyzed sep- 

arately for each subgroup. The results indicate no appreciable difference in 

нен to the two forms, and so the data were pooled for all subsequent 
Ses. 


528 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Kuder Preference Record-Occupational, Form D (Kuder, 1956) 
and the Cooperative School and College Ability Test (SCAT)- 
Form A (Cooperative Test Division, Educational Testing Service, 
1955). These instruments are administered to all incoming fresh- 
men at the University as part of a battery of F reshman Guidance 
Examinations. Both instruments are well standardized and widely 
used for guidance work. 


Subjects 


The experimental subjects were male introductory psychology 
students at the University of Illinois. The subjects were required to F 
participate in the experiment as part of the course requirements. 
The great majority of the subjects were freshmen and sophomores. | 


The mean age for the total sample of 280 subjects was 18.7 years. 


Results 


The Occupational Preference Questionnaires were scored on ап 
IBM 650 computer with the aid of a program previously written by 
Gulliksen and Tucker (1959). This program provides for both in- 
dividual and group analyses. 

There are a total of 33 “scores” for each individual. These consist 
of: 


(1) A “votes against” score for each of the 31 stimuli. This score 
is the number of times that the subject indicates a preference for. 
each of the remaining stimuli over a given stimulus. This score 
ranges from 0 to 31. (The range is 0 to 30 if one considers only 
the multiple rankings among the 31 occupations. However 

responses to the “Like-Dislike” section of the questionnaire were 
also incorporated into the score. This section was treated as if the 
items represent paired comparisons between each of the stimi 
and a subjective “neutral point.” This neutral point is an by- 
pothetical additional stimulus. Therefore, if a subject gives & 
“Dislike” response to a stimulus, we assume that the neutr k 
point is “preferred” to the stimulus and, consequently, another 
tally is added to the “votes against” score for that stimulus. 
Inclusion of the neutral point as an hypothetical stimulus thus 
makes the maximum “votes against” score 31.) E 
(2) A "votes against" score for the neutral point. This is simply 


JAMES 8. TERWILLIGER 529 


a tally of the number of “Like” responses given by the subject in 
the “Like-Dislike” part of the questionnaire. As with the scores 
in (1), the possible range is 0 to 31. 

(3) The number of circular triads present in the subject’s re- 
sponses, i.e., the number of times the subject ranks A > B, B > 
C, and C > A. This score can take on values ranging from 0 to 
1360. 


The Occupational Prestige Questionnaires were scored in the same 
fashion. 

Gulliksen and Tucker (1961) have shown that if the number of 
circular triads is represented by d, then: E [d] = 968.75 and Var 
[d] — 3955.73 for a 6-31 design. The mean number of circular triads 
in the preference and prestige rankings are 116.6 and 157.26, re- 
spectively. We therefore reject the hypothesis that our data consist 
of random responses. 

The Goals of Life and Job Attributes Questionnaires were also 
scored on the IBM 650. As before, a “votes against” score for each 
of the 13 stimuli and the neutral point as well as a tally of the num- 
ber of circular triads was obtained for each subject’s responses. In 
the case of the Goals of Life instrument, the scores were combined 
into four “factor scores” based upon previous analyses (Tucker, 
1956). 

The Allport-Vernon Questionnaires were scored to obtain sub- 
scores for each of the six general values which the instrument reput- 
edly taps. 

The Kuder Preference and SCAT scores are in decile form. There 
are 10 Kuder scores; one for each of Kuder’s major occupational 
areas (Kuder, 1956), and three SCAT scores (Verbal, Quantitative, 
and Total). ل‎ «ма 

The primary analysis is the determination of the dimensionability 
of the preference judgments. The occupational preference and job 
attribute variables are included in this analysis since interest centers 
on both the specific occupations preferred and the more general at- 
tributes desired in a job. 

The eross-produets among the 32 occupational preference and 14 
lob attribute variables were first obtained. (The circular triads 
. Measures were omitted.) Cross-products are desired since they in- 
. Clude effects due to means whereas such effects are ignored by the 


530 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


more conventional correlations or covariances. However, the use 
of cross-products requires that "corrections" be made because of 
the more restricted range of scores for the job attribute variables 
(grand mean — 6.5) as compared with the preference variables 
(grand mean — 15.5). Since the ratio of the two grand means is ap- 
proximately 2 to 1, the cross-products involving a preference and 
job attribute variable were multiplied by 2 and those involving two 
job attribute variables were multiplied by 4. These “corrections” are 
admittedly only approximate, but they have the effect of placing all 
of the cross-products among the 46 variables on a more equal scale. 

The matrix of “corrected” cross-products was then analyzed. This 
matrix involves cross-products among ipsatized measures and there- 
fore requires the special analytic technique which Tucker (1956) 
has derived for double-centered score matrices. The matrix was 
first submitted to a principal axes analysis. The first seven roots 
were retained as a basis for computing revised estimates for the 
entries in the main diagonal of the matrix. 

The estimated uniqueness for each variable was determined by 
subtracting the revised diagonal entry from the corresponding origi- 
nal diagonal element. These values were used to make corrections for 
uniqueness in the off-diagonal cells of the matrix. This correction 
was made separately on the submatrices of cross-products within 
the occupational preference (32 X 32) and job attribute (14 X 14) 
variables, since the scores are “centered” within each set of variables. 

The principal axes analysis was reiterated on the revised matrix. 
The nine largest roots were kept. These were used to compute new 
estimates for the diagonals in the same manner as had been done 
after the first analysis. These new estimates were compared to the 
corresponding estimates based upon the first analysis. The distribu- 
tions of the two sets of values are in close agreement and it was, 
therefore, decided that further iterations would result in only in- 
significant changes. 

The varimax rotation was applied to the nine principal axis fae- 
tors. The varimax solution revealed a need for a translation of the - 
axes with respect to the preference variables. (A translation of the 
axes is often required when a double-centered analysis is performed 
because the origin has been “shifted” due to the ipsative nature of 
the variables.) 


This translation was achieved by adding a tenth factor with con- 


JAMES S. TERWILLIGER 531 


stant (.50) loadings for all of the 32 preference variables and zero 
loadings for the 14 attribute variables. This has the effect of adding 
a constant to all of the cross-products in the 32 X 32 submatrix of 
preference variables. The factor was included with the original nine 
factors in à new varimax rotation. The loadings on the 10 resulting 
varimax factors are presented in Table 1. 


TABLE 1 
Varimaz Factor Loadings for Preference and Attribute Variables* 


I Il I Iv V VI VI VII IX. X 


Lc S , ip ESSI 


1. clergyman 05 —69 25 39 16 26 06 03 -29 17 
2. nuclear engineer 93 24—11 04 01 07 06 —06 05 02 
3. operational pro- 
grammer тт —93 19 —03 —02 —01 —O1 01 -08 04 
4. industrial designer 42 10 28 —02 оз 06 54 03 —01 —01 
5. union business agent 04 —26 84 06 04 —02 —08 —03 —05 15 
6. meteorologist 38 —07 06 12 01 56 08 08 —04 —03 
7. life insurance 
Е salesman 01 —22 80 05 04 08 —01 00 -23 02 
. guidance and con- f 
5 E engineer 89 15 —00 02 -02 01 09 —01 07 00 
. logical desi 
engineer $ 84 —07 07 —05 02 =01 12 00-04 01 
10. civil engineer Bi. 14.07 Афу 16 34 —01 -03 —07 
11. bank president 01 56 70 29 —09 02 14 —17 -01 06 
12. high school teacher —02 —13 20 79 05 20 07 13 -08 —05 
18. sociologist 03 -11 26 27 25 31 —00 12 09 42 


14. newspaper editor —03 15 58 32 зз 01 02 @ m 
15. curator, art gallery 05 —84 28 maoe aao а xen ta 
16. head, construction 


compan; 1 36 44 115.16 0500 
17. geologist 50203103119 08. ОО 
18. astrodynamicist $2 —02 —1i —-01.0081063.—07 АШИ еа 
19. architect 24 34. 14. ТАЗЫ E Др 
20. novelist, 24 13 20 x 84 18 —10 06 01.00 
21. college dean —03 19 36 65 07 07 01 —08 Ж Е 
32 hotel mamuer. —08 201 8 1 O 
23. anthropologist шов o 22 38 M оси 
24. lawyer os о 4\0 O 12 10744 И 34 
25. artist 01 -38 13 06 88 14 4 06 —08 "А 
26. college teacher 06 17 12 81 10 13 04 nu 03 


7. rocket-test engineer 84 35 01 00 —00 Е -02 11 05 
. advertising MEE -01 08 76 01 21—07 21 02 ш з 
29. вигдеоп 16 54 07 38 21 27 08 —13 be 
30. solid state physicist 91 —17 —10 01 08 21 —03 —01 n aes 
31. historian оо. =88.. 28.) 38,80. 401029 9. n" a 
32. neutral point 24 —95 36 15 16 21 O1 01—05 

33. securit a a-n 04 ОЛОР —25 15 
34. high velim u m 01 —03 —01 —01 —01 —57 06 —02 
35. serving others M o а о СОВ T 


532 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 1 (continued) 
Varimaz Factor Loadings for Preference and Attribute Variables* 


—————————D 


ТОШОП VY VY VI VIL vu Ix X 


ا س 


36. authority 02 —14 09 —08 —09 01 —00 —33 14 —05 
37. creativity 07 28—29 12 17-01 16 29 30 —05 
38. recognition —01 02 01 04 08 —07 —02 —36 06 —03 
39. opportunity for s 

advancement 04 43 —00 05 —15 —04 13 05 —04 —02 


40. working conditions —02 —10 04 01 —03 00 —06 22 —33 00 
41. making important 


decisions 06 12-03 00 —13 03 02-11 31 09 
42. high income —02 51 12 —04 —01 —06 —01 —26 —08 —03 
43. personalinteraetion —09 —15 —00 04 07 —06 04 95 10:218 
44. varied activity 04 —10 —02 —01 08 01-08 38 —03 —14 
45. independence —03 21 —03 06 04 04-05 15 15—04 
46. neutral point —01 -81 13 —23 09 08 —14 —03 —32 —138 
EM ЕЛ Ts i  _ LIII 

* Decimal points have been omitted. 


The final step in the analysis is to obtain the relationship between 
all the variables not included in the factoring and the 10 factors 
which were found. This is accomplished directly through the factor 
A technique (Dwyer, 1937). The results are presented in 
Table 2. 


TABLE 2 


Varimaz Factor Loadings for Prestige, Allport-Vernon, 
Goals of Life, Kuder, and SCAT Variables* 


کد 


КОШОТ Y vi уш IX, Х 


l.clergyman _ —07 26 02 23 —17 —09 08 16 09 06 
2. nuclear engineer 25 69 —93 07 —16 -02 09 05 20 04 
3. operational pro- 

grammer — 24 -20 -03 —17 —11 -07 00 03 ~08 08 
4. industrial designer 04—01 01 —10 —12 —03 30 05 —10 04 
5. union business 


agent —— —07 -68 35 —29 03 —06 —13 —06 —23 00 
6. meteorologist 04 —28 —12 —10 —05 24—02 02 —08 08 
T. life insurance 

salesman —08 —76 33 —28 —01 02 —13 —09 —34 —07 
8. үзе апа соп- 

trol engineer 20 24—16 —03 —15 - 04 00 
9. хен design Есето о 

епдїпеег 25 05—11—11—10- o5 00 04 
10. civil engineer 08 06 —05 —00 mA Be —00 —08 —00 
11. bank president -14 77 21 17 18 -14 12-09 14 -11 
12. high school teacher —09 —60 02 19 —04 —00 -12 00 —21 —08 
13. sociologist -12 -28 -08 -03 17 07-08 12-02 22 


14. newspaper editor —21 15 20 13 12-15 -10 ~05 19 —06 
15. curator, art gallery —07 -83 05—16 32 04 -12 —02 —22 —16 


JAMES 8. TERWILLIGER 533 


TABLE 2 (Continued) 


Varimaz Factor Loadings for Prestige, Allport-Vernon, 
Goals of Life, Kuder, and SCAT Variables* 


С АЕ 


16. head, construction 


company —09 —30 31 —14 —14 02 02 —20 —12 —22 
17. geologist 03 —22 —16 —07 03 28 —05 06 —03 —04 
18. astrodynamicist 25 28 —28 —04 —03 13 —03 04 15 01 
19. architect —05 38 —08 05 —01 —02 32 00 1l 05 
20. novelist —12 10 —06 04 34 02—19 04 15 04. 
21. college dean —15 61 00 38 —09 —09 02 —01 22 —06 
22, hotel manager —17 —50 36 —12 02 —04 —13 —16 —28 —19 
23. anthropologist —04 —24 —21 —05 18 24 —05 02 01 13 
24. lawyer —14 88 02 21 —06 —03 07 01 29 09 
25. artist —03 —26 —07 —10 36 —02 —03 02 —10 —05 
26. college teacher —07 —06 —08 38 —04 —09 —07 04 03 —08 
27.rocket-test engineer 24 53 —18 —01 —16 —01 10 (05 14 06 
28. advertising agent —13 —50 32 —23 П —12 —04 —11 —12 03 
29. surgeon —06 108 —11 24 —06 —02 07 04 28 16 
30.solid state physicist 25 40 —29 07 —09 05 03 10 18 01 
31. historian —09 —44 —04 03 19 08 —23 —05 —05 06 
32. neutral point 02 —42 08—12 04 01 —08 —08 —18 —09 
33. Theoretical 33 03 —25 —13 —03 13 —12 06 09 —10 
34. Economic 05 18 37 —26 —27 —07 07 —16 —06 —18 
35. Aesthetic —15 —04 —20 05 52 201 1707 50205914 
36. Social -08 —07 —02 09 —04 05—13 19 05 26 
37. Political —04 11 21 03 —00 —07 —19 —27 08 —09 
38. Religious Zio ~21 —12 29 —18 —04 20 11 —19 24 

‚ 99. Service —02 —08 —04 05 05 —00 —07 20 21 06 
40. Religious —04 —12 —08 19 —23 =01 25 01 520 08 
41. Power 09 18 08-26 19 —02 —13 —30 09—12 
42. Stoicism _оз 08 07-08 06 04—11 05 00-0! 
43. Outdoor 00 02-00 07 -12 19 —07 10 —05 —29 
44. Mechanical 12 05 —02 -05 —05 00 от RUM jr 
45. Computational 14 00 05-02-19 -07 —05 —07 —08 о 
46. Scientific 19 05 —22 -05 07 10—17 00 05 0 
47. Persuasive ло 00 17-09 10-14 07 05 10 07 
48. Artistic —00 —05 —12 -05 22-07 24 —02 05 02 
49. Literary Zos 02-02 05 29 -07 —19 02 02 e 

Musical Zoo —07 —10 14 —02 —05 02—02 mo si 

Social service “12 —05 —05 05-02 12 —05 00 E 
52, Clerical 02 02 05 07-17 -00 00—07 тиси 
53. Verbal -00 оз 206 06 21 06 —15 -03 0 -03 

. Quantitative 12 00 00-0 -12-03 -06 —03 00 703 

Total os 03 —03 06 0—06 —15 —09 08 — 


* Decimal points have been omitted. 
The signs for the Allport-Vernon, Kuder, and SCAT variables have 
n reflected in Table 2 so that the direction of scoring for these 


Variables is consistent with the scoring of the other variables. Also, 


534 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


the loadings of the Kuder and SCAT variables have been multiplied 
by sealing constants to adjust for the relatively small variances of 
these variables. (The Kuder and SCAT measures were decile scores 
and consequently highly restricted in variability.) The scaling con- 
stants used are the ratios of the average standard deviation for the 
prestige rankings to the average standard deviation of each of the 
two sets of variables. These ratios are 2.4 and 3.0 for the Kuder and 
SCAT, respectively. 
Discussion 

Factor I is clearly defined by preference for scientific-technical 
occupations. Occupations which load highly on this factor include 
nuclear engineer (93), solid state physicist (91), guidance and con- 
trol engineer (89), logical design engineer (84), rocket-test engineer 
(84), astrodynamicist (82), and operational programmer (77). 
There are no appreciable negative loadings for occupations on this 
factor. The job attribute variables have negligible loadings. This 
means that preference for scientific-technical occupations is in- 
dependent of preference for the general attributes which were ве- 
lected for this investigation. 

Factor II is defined by both the occupations and attributes. Vari- 
ables with high positive loadings are: lawyer (74), bank president 
(56), surgeon (54), high income (51), and opportunity for advance- 
ment (48). Those with large negative loadings include: curator, art 
gallery (—84), neutral point for attributes (—81), clergyman 
(—69), serving others (—59), historian (—38), and artist (—38). 
This factor seems to contrast preference for occupations which yield 
material rewards for personal success with those which do not. The 
factor is interpreted as preference for high income. 

There is also some justification for calling this factor a generalized 
preference factor. This is suggested by the fact that the loadings on 
this factor order the variables according to their means. This ex- 
plains why the neutral point for attributes has such a large negative 
loading, i.e., the neutral point had the lowest mean score for this set 
of variables since all of the attributes were considered generally 
desirable by the subjects. Even if this factor is considered to be à 


з A more detailed description of the data is available in the complete dis- 
sertation. Copies of the dissertation are available through University Micro” 
films, Inc., 313 North First Street, Ann Arbor, Michigan. 


JAMES 8. TERWILLIGER 535 


general preference factor, it is the writer's feeling that psychological 
significance can be attached to the ordering of the means. "Therefore, 
the interpretation which has been discussed above is preferred. 

Factor III represents preference for persuasive occupations. Oc- 
cupations with high loadings include: union business agent (84), 
life insurance salesman (80), hotel manager (78), advertising agent 
(76), bank president (70), head, construction company (64), and 
newspaper editor (58). There is some suggestion that preference for 
persuasive occupations implies non-preference for scientific-techni- 
eal occupations since only occupations with high loadings on Factor 
I have negative loadings on Factor III. The only loading among 
the job attribute variables which loads appreciably on Factor III is 
creativity (—29). 

Factor IV is preference for occupations which involve teaching. 
College teacher (81), high school teacher (79), and college dean 
(65) clearly define this factor. Other occupations with large loadings 
are: lawyer (40), clergyman (39), and surgeon (38). Each of these 
occupations involve elements of teaching, especially those aspects 
usually associated with classroom lecturing. The loadings for the at- 
tribute variables are very low and insignificant. . 

The fifth factor is defined by preference for: artist, (88), novelist 
(84), and curator, art gallery (64). This is obviously preference for 
artistic occupations. It is interesting to note that industrial designer 
(03) and architect (23), two occupations generally regarded as 
closely allied to the arts, do not load heavily on this factor. The 
explanation for this will appear below. 

None of the attribute measures load highly on Factor V but the 
pattern of loadings is generally consistent with the stereotype con- 
ception of the artist, There is some concern for creativity (17) and 
little concern for security (—16), advancement (—15), or making 
Important, decisions (—13). T 

Factor VI is interpreted as preference for outdoor-scientific oc- 
Cupations. Occupations with high loadings are: geologist (67), an- 
thropologist (64), and meteorologist (56). There are no appreciable 
loadings on this factor among the attribute measures. 1 

Factor VII is defined by high loadings on architect (62), in- 
dustrial designer (54), artist, (41), and civil engineer (34). This 
helps to explain why architect and industrial designer did not have 
higher loadings on Factor V. It appears that these two occupations 


536 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


involved elements which cause them to be perceived differently from 
the other occupations in the artistic group. The most obvious dif- 
{егепсе between these two and the remaining artistic occupations 
would appear to be in the technical skills and training required for 
the occupations. This interpretation is reinforced by the fact that 
civil engineer loads rather highly on this factor while novelist has a 
loading of —10. This factor is therefore interpreted as preference 
for artistic-technical occupations. 

Although the job attribute loadings are generally low on Factor 
VII, it is interesting to contrast their pattern with that on Factor V. 
Creativity loads positively on both Factor V and VII. However, 
security and opportunity for advancement load negatively on Fac- 
tor V but positively on Factor VII. As pointed out above, the person 
who prefers the purely artistic occupations (Factor V) has little 
concern for security or advancement but this is not the case for the 
person who prefers artistic-technical occupations (Factor VII). 

The attribute variables define Factor VIII. Negative loadings ap- 
pear for high prestige (—57), recognition (—36), authority (—33), 
and high income (—26). Variables with positive loadings are: varied 
activity (38), creativity (29), personal interaction (25), serving 
others (22), and working conditions (22). This is therefore a bi- 
polar factor. The factor is interpreted as contrasting preference for 
attributes associated with the role which a job defines (prestige, 
Tecognition, authority, and income) as opposed to preference for 
the activities involved in a job (varied activity, creativity, serving 
others, and personal interaction). At the same time, preference for 
specific occupations is unrelated to these attribute preferences since 
none of the occupational variables load appreciably on this factor. 

Only tentative interpretations are suggested by Factors IX and Х.' 
The attribute loadings on Factor IX contrast concern for working 
conditions (—83) and security (—25) with preference for making 
important decisions (31) and creativity (30), Occupations with 
negative loadings include: clergyman (—29), life insurance sales- 
man (—23), and curator, art gallery (—21). Those with highest 
positive loadings are lawyer (25), newspaper editor (21), and 
architect (17). This factor is, therefore, bi-polar and appears to 
represent a security vs. creativity dimension. 

Factor X suggests a dimension defined by interest in others. Vari- 
ables with highest loadings are: surgeon (43), sociologist (42), law- 


JAMES 8. TERWILLIGER 537 


yer (34), historian (23), anthropologist (22), personal interaction 
(15), security (15), and serving others (15). 


Relation of Remaining Measures to Obtained Factors 


Table 2 presents the loadings of the Prestige Ranking, Allport- 
Vernon (AV), Goals of Life (GL), Kuder (KD), and SCAT vari- 
ables on each of the 10 varimax factors. (The prestige variables have 
been corrected for uniqueness attributable to the fact that the same 
occupational titles were ranked for both preference and prestige.) 

The pattern of loadings for the prestige variables on Factor I is 
much the same as for the corresponding preference variables. The 
major difference is that the general magnitude of the prestige load- 
ings is considerably lower than for the preference variables. This 
means that although this dimension contributes most to the cross- 
product of the preference variables, it contributes relatively little 
to the prestige cross-products. (It ranks sixth in this respect.) Also, 
the dimension is truly bi-polar for the prestige variables. Persuasive 
occupations are sharply contrasted to scientific occupations, with 
the former loading negatively and the latter positively. 

Of the remaining measures, those with highest loadings on this 
factor are: AV-Theoretical (33) and KD-Scientifie (19). These 
loadings support the interpretation given to the factor. Д 

Factor II contributes more than any other factor to the prestige 
cross-products, Loadings for the prestige variables on this factor 
range from a high for surgeon (108) to a low for curator, art gallery 
(—83). The loadings of the prestige variables on this factor reflect 
the mean prestige ranking of each occupation, i.e., surgeon had the 
"highest mean ranking and curator, art gallery the lowest. If the in- 
terpretation which has been given to this factor is accepted, the 
implication is that the income of a job is the most important de- 
terminant of the prestige which is attached to it. HS 

The highest loadings on Factor II among the remaining variables 
are: AV-Economie (18), GL-Power (18), and AV-Religious (—21). 
These loadings are quite consistent with our interpretation of the 
factor, 

Factor III was interpreted as preference for persuasive ا ي‎ 
tions. This factor ranks second with respect to contribution to the 
Prestige cross-products. The pattern of the loadings of the prestige 


538 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


rankings is much the same as for the corresponding preference vari- 
ables. Here we find that the persuasive occupations load positively, 
whereas the scientific-technical occupations load negatively. As 
pointed out above, the reverse is true for Factor I. 

Other variables which have relatively large loadings on Factor IIT 
are: AV-Economic (37), AV-Political (21), KD-Persuasive (17), 
AV-Theoretical (—25), AV-Aesthetic (—20), and KD-Scientifie 
(—22). These loadings support the notion that this dimension con- 
trasts persuasive as opposed to scientific interests. 

Factor IV ranks third with respect to contribution to the prestige 
cross-products. Those occupations which involve teaching have posi- 
tive loadings as expected. However, it is interesting to note that the 
prestige loadings seem to be tempered by the amount of education 
necessary for employment in an occupation. For example, the pres- 
tige loadings for clergyman, college dean, lawyer, and surgeon are 
larger than the loading for high school teacher. 

The prestige variables which have large negative loadings on 
Factor IV are union business agent (—29), life insurance salesman 
(—28), and advertising agent (—23). If our reasoning above is cor- 
rect, these loadings resulted primarily because these three occupa- 
tions require less formal education than any others. Loadings of the 
remaining variables reveal that high religious values [AV-Religious 
(29) and GL-Religious (19)] and low economic and power values 
[AV-Economie (—26) and GL-Power (—26)] are characteristic of 
this dimension. 

Prestige variables with positive loadings on Factor V include 
artist (36), novelist (34), curator, art gallery (32), historian (19), 
anthropologist (18), and sociologist (17). Therefore, prestige of the 
social sciences as well as the arts is picked up by this factor. There 
seems to be no simple explanation for low prestige rankings on this 
factor. 

The interest and value measures support this view. Although the 
positive loadings are consistent and well defined [AV-Aesthetie 
(52), KD-Literary (29), KD-Artistie (22), and GL-Power (19)], 
the negative loadings suggest a variety of values and interests [AV- 
Economic (—27), GL-Religious (—23), KD-Computational (—19), 
AV-Religious (—18), and KD-Clerieal (—17)]. The ability meas- 
ures have their highest loadings on this factor [Verbal (21) and 
Quantitative (—12)]. Although not high in an absolute sense, these 


ao 


JAMES S. TERWILLIGER 539 


loadings are consistent with interest in and preference for artistic 
as opposed to technical types of work. 

Prestige variables which load highly on Factor VI are those oc- 
cupations which define the factor, i.e., geologist (28), meteorologist 
(24), and anthropologist (24). The loadings of the remaining pres- 
tige variables are generally low and suggest no further interpreta- 
tion. KD-Outdoor (19), AV-Theoretical (13), and KD-Scientific 
(10) load positively as expected. KD-Persuasive (—14) is the only 
interest or value measure with an appreciable negative loading. 

Prestige loadings on Factor VII closely reflect the definition of 
the factor, i.e., architect (32), industrial designer (30), and civil 
engineer (17) are the only variables with interpretable positive load- 
ings. As in the case of preference rankings, historian (—23) and 
novelist (—19) have negative loadings on this factor. Other meas- 
ures with positive loadings are GL-Religious (25), KD-Artistie (24), 
AV-Religious (20), and AV-Aesthetie (17). Those with negative 
loadings include KD-Literary (—19), AV-Political (—19), KD- 
Scientific (—17), and SCAT-Verbal (—15). These loadings clearly 
show that this factor distinguishes between technical-artistic as op- 
posed to literary-artistie interests. Also, it is interesting that re- 
ligious values go with technical-artistie interests, whereas such 
values are negatively related to artistic interests as measured by 
Factor V. 

Factor VIII is defined by contrast in preference for the role- 
related vs. the activity-related attributes of a job. The prestige 
loadings on this factor are generally quite low and only suggestive. 
However, it is clear that this factor taps some of the values measured 
by the other instruments. Variables with high loadings are GL-Serv- 
ice (20), AV-Social (19), GL-Power (—30), AV-Political (720; 
and AV-Economie (—16). These loadings are, of course, quite con- 
sistent with the interpretation of this factor. Х 

Factor ІХ was tentatively interpreted as a dimension which con- 
trasts preference for creativity and security. This factor is particu- 
larly interesting since it ranks fourth among the 10 factors with 
Tespect to contribution to the prestige cross-products. Prestige vari- 
ables which have positive loadings include: lawyer (29), RUrgeon 
(28), college dean (22), nuclear engineer (20), newspaper editor 
(19), and solid state physicist (18). Those with highest negative 
loadings are life insurance salesman (—34), hotel manager (—28), 


540 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


union business agent (—23), curator, art gallery (—22), and high 
school teacher (—21). This pattern of loadings is consistent with 
the fact that “making important decisions" and “creativity” define 
the positive end of the continuum while “working conditions" and 
“security” define the negative side. Religious values [GL-Religious 
(—26) and AV-Religious (—19)] and clerical interests [KD-Oleri- 
cal (—22)] are also related to the “security pole" of this factor. 

Loadings of the prestige variables on Factor X are generally low 
but are consistent with the corresponding preference loadings. Value 
and interest measures which are positively related to this factor are 
KD-Social Service (31), AV-Social (26), and AV-Religious (24). 
Those with negative loadings are KD-Outdoor (—29), KD-Mechan- 
ieal (—24), and AV-Economie (—18). The positive pole of this 
factor is, therefore, rather clearly defined as interest in people while 
the negative loadings suggest interest in objects or things. 

There are three general statements which can be made to sum- 
marize the loadings in Table 2. First, the pattern of the loadings 
of the prestige variables is much the same as for the corresponding 
preference variables. The magnitude of these loadings is somewhat 
less, but this is not surprising since the preference variables were 
used in defining the factors. 

Second, the loadings of the remaining interest and value variables 
are generally small, but consistent with the interpretation given to 
the factors. Especially in the case of the Kuder scores, the loading 
may seem smaller than would be expected. This can probably be 
accounted for by the way in which these scores are defined. 

The Kuder questionnaire consists of triads of stimuli which repre- 
sent leisure-time activities, hobbies, and vocational and school sub- 
ject interests as well as occupational preferences. The scores which 
are obtained from such an instrument, therefore, tend to be much 
more broadly defined than are the factors which resulted in our 
preference questionnaire. Additional “slippage” is introduced by 
the differences in sampling of occupations in the two questionnaires. 
For example, the first factor in our analysis is defined primarily by 
occupations which do not appear in the Kuder due to the recency 
of their evolution. We have interpreted the factor as scientific-tech- 
nical interest, but it is clearly not the same as the Kuder Scientific 
scale. Similar differences exist with other factors since the occupa- 
tions which we sampled were restricted to those which would be 


JAMES 8. TERWILLIGER 541 


appropriate for college graduates. Such а restriction means that 
Kuder scales for Outdoor, Mechanical, Computational, and Clerical 
occupations were not represented in our sampling. 

Third, the ability measures as represented by the SCAT scores 
do not relate to the preference factors. 


Summary 


This study was undertaken to: (a) determine the dimensionality 
of the occupational preferences of college students, and (b) relate 
other measures of preference and ability to the dimensions which 
result from analysis of occupational preferences. 

Occupational preference rankings were obtained from 280 under- 
graduate males. The preference questionnaire consists of 31 oc- 
cupational titles cast into a “balanced incomplete blocks” design as 
described by Gulliksen and Tucker. The subjects ranked occupa- 
tional prestige and the desirability of various job attributes and 
general goals of life оп separate questionnaires, which were con- 
structed with variations on the incomplete blocks design. Allport- 
Vernon, Kuder Preference, and SCAT scores were also obtained. 

A principal axis factor analysis was performed on the cross- 
products among the 46 occupational preference and job attribute 
variables using the technique derived by Tucker for double-centered 
matrices, Ten factors were retained following а translation with 
respect to the preference variables and a rotation according to the 
varimax criterion. 

Eight factors were interpreted. Six are defined predominantly 
by the occupations and indicate preference for: scientific-technieal 
Occupations (I), persuasive occupations (III), oceupations which 
involve teaching (IV), artistic occupations (V), outdoor-scientifio 
occupations (VI), and occupations which require artistic-technical 
skills (VII). Factor II is defined both by occupations and attributes 
and was interpreted as preference for high-income occupations. Fac- 
tor VIII is a bi-polar factor which contrasts preference for the ac- 
tivity-related attributes versus the role-related attributes of a job. 

These results suggest that occupational preferences are determined 
by the activities which are involved in a job rather than by more 
general attributes which may characterize the occupation. The only 
attributes which appear to relate to job preferences are those which 
contrast material and non-material rewards. 


542 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Otherwise, the dimensions found in the present study are gen- 
erally consistent with those which have been reported by previous 
investigators using more conventional interest measures. The dif- 
ferences which do exist reside in the manner in which the factors 
are defined. This can be attributed to differences in sampling of 
variables. Such differences, however, are important since the domain 
of occupations "available" is constantly changing. This is especially 
true in the case of space and computer technologies which have 
existed for only relatively short periods of time. These changes can- 
not fail to alter the perception of occupational opportunities and 
consequently change the structure of preferences over time. 

The remaining prestige, preference, and ability measures which 
were obtained were related to the factors through the factor exten- 
sion technique. The pattern of the loading for these variables is gen- 


erally consistent with and gives support to the interpretation of the 
factors. 


REFERENCES 


Buros, 0. K. The Fifth Mental Measurements Yearbook. Highland 
Park, N. J.: Gryphon Press, 1959. Ў 

Cochran, W. С. and Сох, б. M. Experimental Designs (First Edi- 
tion, 1950; Second Edition, 1957). New York: John Wiley & 

ns, 

Cooperative Test Division, Educational Testing Service. General 
Goals of Life Inventory. Princeton, N. J.: Author, 1950. 

Cooperative Test Division, Educational Testing Service. Coopera- 
к and College Ability Tests. Princeton, N. J.: Author, 


Dwyer, P. S. “The Determination of the Factor Loadings of a Given 
est from the Known Factor Loadings of Other Tests." Psy- 
chometrika, П (1937), 173-178. 
AL eis BER A “Paired Comparisons from Balanced 
ocks, 6 i ile Number 
6.0038, 1959, 50 Program Library, File Num 
Gulliksen, Н. and Tucker, L. В. “A General Procedure for Obtain- 
ing Paired Comparisons from Multiple Rank Orders.” Psy- 
chometrika, XVI (1961), 173-183. 
Kuder, G. F. Kuder Preference Record—Occupational, Form D. 
Chicago: Science Research Associates, 1956. 
Tucker, L. R. “Factor Analysis of Double Centered Score Matrices.” 


Research Memorandum 56-3, Princeton, N. J.: Educational Test- 
ing Service, 1956. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vol XXIII, No. 3, 1963 


THE MEASUREMENT OF SCALABILITY FOR 
NON-CUMULATIVE ITEMS* 


JOACHIM F. WOHLWILL 
Clark University 


CONSIDERABLE attention has been given in recent years to the 
quantitative treatment of scalogram data, based on Guttman's 
cumulative model of scalogram analysis (ef. White & Saltz, 1957). 
This model postulates the presence of a unidimensional attribute 
underlying both a set of items and subjects’ responses to this set, 
such that a subject will respond to (or pass) all items falling below 
his own position on this assumed dimension, and no items located 
above this point. 

As Torgerson (1958) has pointed out, this model has а logical 
counterpart in the non-cumulative, or point model, in which а 
similar unidimensional attribute underlying both stimuli and re- 
sponses is postulated, but with the subject responding only to those 
items lying at or close to his own position on the scale. A good 
example of a scale fitting this model is an adaptation of a Thurstone 
attitude scale, in which, from among a set of items presumably 
sampling different points along a single dimension, a subject checks 
those with which he agrees most closely. 

In spite of its direct relevance to the measurement of attitudes, 
however, this model has not been subjected to mathematical treat- 
ment, with the exception of an unpublished paper by Mosteller 
(1949). In particular, the problem of the measurement of scalability 


ШИ. 


1The writer is indebted to W. H. Crockett for helpful comments and sug- 
gestions regarding this paper, as well as to М. Cohen and В. Beck for per- 
Mitting the use of their classes for the collection of the illustrative data. 


543 


544 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


for data conforming to this model remains to be solved. While 
Torgerson suggests, plausibly enough, that this problem may be 
approached in essentially the same way as has been done for the 
cumulative model, an attempt at a more explicit treatment seems 
desirable; indeed it will be seen that the extension of procedures 
developed for the cumulative model to the point model is by no 
means as straightforward as it might appear. 


The Problem of Item Order 


Let us assume that a set of n items, believed to fit the non- 
cumulative unidimensional model, has been administered to a group 
of subjects, with instructions to check r of these items, representing 
those they like best, or agree with most closely. If the responses are 
to provide any information about the ordering of the items along 
the assumed dimension, r will have to be greater than one and less 
than n (cf. Torgerson, 1958, p. 312); typically r will be a small 
number such as two, three, or at most four, in order to preserve a8 
much of the “point” character of the model as possible (i.e., re- 
sponses given to items within a narrow range of the subject’s own 
position on the continuum). The first problem is now to arrange the 
items in the order of their assumed scale values so that a response 
matrix can be set up which will serve as a basis for the subsequent 
determination of the degree of scalability of the responses. 

This problem is rather more complicated here than in the case of 
the cumulative model, where the number of subjects responding to 
each item can serve as a convenient basis for ordering the items 
(e.g., from least popular to most popular, or from most difficult to 
‚ least difficult). In the present model, it is obvious that the relative 
"popularity" of the items need not bear any relationship to their 
scale values. An analytic procedure for determining the “best” or- 
dering for any set of items in the non-cumulative case has, how- 
ever, been developed by Mosteller (1949). It consists in assigning 
а set of weights z; to the items во as to maximize the correlation 
ratio of the resulting scores, based on the following rationale: If 
each of N subjects checks r items, the variance of the resulting Nr 
scores can be partitioned into a between-subjects and a within- ' 
subjects component; choosing the weights so as to maximize the 
correlation ratio means maximizing the between-subjects variance, 


JOACHIM F. WOHLWILL 545 


relative to the total variance, or, conversely, to minimize the within- 
subject variance. Thus the item-weights will be such as to ensure 
а maximum of homogeneity in the т responses of each subject. 

The actual procedure involved will be presented here only in brief 
outline form (cf. Mosteller, 1949, for a more detailed presentation). 
First of all, a preliminary guess is made concerning the most likely 
order of the items, Inspection of the most frequent response pat- 
terns will normally provide the basis for such a guess; indeed, it 
will often disclose in itself that the items do not conform to scal- 
ability. Thus for r — 3, if two high-frequency patterns have a pair 
of items in common, (e.g., ABC and BCD) scalability requires that 
the third items of these two patterns (A and D in this instance) lie 
on opposite sides of the common pair BC on the scale. This means 
that any other pattern containing this pair will become an error pat- 
tern, as will any pattern containing the two extremes, А and D, as 
a pair. To the extent, then, that such a nucleus ABCD, established 
from two high-frequency patterns, leaves a substantial number of 
the remaining patterns as not incorporable into the continuum, one 
would have to regard the data as not scalable. 

Once the items have been arranged in some preliminary order, 
a tentative set of weights 2; is assigned to them, such that ж = 
0, f, representing the number of subjects checking item t. (Thus the 
average of all weighted scores will approximate zero—an arbitrary 
but convenient proviso.) The f; matrix is then laid out, showing the 
number of subjects checking all combinations of pairs 1] of items. 
(The /?в appear in the diagonal cells of this matrix.) | 

A second set of weights, 2,’ is now obtained, by cross-multiplying 
the f,,’s by their respective column-weights 2, adding these cross- 
products row-wise and dividing the resulting sums by the correspond- „ 
ing f/s. This procedure is then repeated, substituting the new set 
of weights z, for the former set т, as multipliers, and by dint of 
Successive iteration a definitive set of weights is arrived at which 
not only shows the optimal ordering of the items, but in addition 
gives some indication of the metric distances between the items. 
It is important to note, however, that this metric scale for the items 
has meaning only in terms of the criterion adopted for the determina- 
tion of the weights, i.e., the correlation-ratio maximizing criterion. 
The absolute values of these weights, furthermore, have no significance 
Whatever, since the choice of unit and origin are completely arbitrary. 


546 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Determination of Scalability 


Let us suppose, then, that the items have been ordered and, based 
on this order, a matrix of all response patterns has been laid out. 
In the ideal case of perfect scalability, this matrix should be of the 
form shown in Table 1, but in practice a certain number of error 
patterns, i.e., patterns where not all the pluses are contiguous, will 
be exhibited, so that it becomes necessary to devise a measure of 
the scalability of the response matrix. 


TABLE 1 
Ideal Response Matriz for Point Model Items, for n = 8, r = 3 


Subject Items 
Types 1 2 3 4 5 6 7 8 
1 s SU ыбы = - = E A 
2 = Bs Ra 4 _ = - = 
3 = - + + Le zx = E 
4 сы E = 2 E "T T та 
5 = + = = + + РЕ = 
6 - - - ~- = + + t 


Bearing in mind that according to the model an error pattern 
can be defined as any pattern in which the r items checked are not 
contiguous, a convenient index of reproducibility is 


A 
A iei 
Rep = 1 – Anasa’ where 

f: are the frequencies associated with each of the „С, possible 
теерорев patterns; (in practice most of these f,’s would be zero); 

e; is the number of errors contained in each pattern (in practice 
the most frequently occurring patterns should be the scale-type 
‘patterns, for which e = 0); 

N is the number of respondents; 

n is the number of items, and 


r is the number of items checked by each respondent. 


It is readily seen that the above formula defines the reproducibil- 
ity of a matrix as one minus the ratio of the number of minuses in 
the matrix that fall between pluses (ie. errors) divided by the 
total number of minuses. 

Like indices of reproducibility developed for the cumulative 
model, the present index is strongly affected by the marginal totals, 


JOACHIM F. WOHLWILL 547 


i.e., the relative popularities of the items. Specifically, the particular 
distribution of these marginal totals will influence the reproducibil- 
ity to be expected by chance, on the assumption of independence in 
the responses to the individual items. While methods have been 
developed for the cumulative model to determine this chance repro- 
ducibility, so as to permit evaluation of the observed Rep in terms 
of it (cf. Green, 1956), this problem is considerably more compli- 
cated in the present case. The reason is that, due to the restriction 
that every response pattern contains precisely r pluses, the mathe- 
matics of independent probabilities are not applicable to this model, 
so that it becomes very difficult to determine what the chance re- 
producibility for any particular response matrix should be? 

A very different approach to the assessment of the scalability of 
the responses suggest itself, however, in the form of the correlation 
ratio eta. It will be recalled that Mosteller’s scaling procedure was 
based on the criterion that the scale values should be such as to 
maximize the value of eta for the response matrix. For a given set 
of data eta provides itself an eminently useful and valid measure 
of the consistency or homogeneity present in the response matrix, 
on the assumption of a unidimensional trait underlying the scale. 
Let us attempt, therefore, to examine its advantages and limitations. 

To start; with the latter, it must be recognized that eta, represent- 
ing a variance ratio based on the scale values determined for the 
items, is clearly not directly comparable to indices of reproducibil- 
ity, consistency or homogeneity which are based on analyses of 
response patterns taking into account only the ordinal relationships 
among the items. Indeed, it is probably best to regard it as comple- 
menting, rather than supplanting, the traditional index of repro- 
ducibility. М 

А second and perhaps more serious apparent drawback of eta is 
that its maximal value can never attain unity, falling short by an 


amount which decreases as the number of items increases, and 
Sr mm all ite! 
In the special case in which the item-popularities are equal for all 1 the 
ie, every item is chosen an equal number of times), it. would be бе е 
to caleulate chance reproducibility, by applying the principles of d a 
for sampling-without-replacement in order to determine the стрес am ы 
quencies of error responses. This case is, however, of little practical ш $ 
since the extreme items on either side of the scale will necessarily have fewer 


choices than those in the middle (eg, А is included only in pattern ABC, 
$ While B is included in this pattern plus BCD, and C in both of the preceding 


plus CDE). 


548 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


which further varies according to the distribution of item popu- — 
larities. These two factors will of course similarly affect the actual 
value of eta obtained for any set of data. К 

It can be argued, however, that according to the logic of the point 
model itself, this property of eta should rather be chalked up in 
favor of this index. The very fact that an individual checks several 
items spread out over a range of the stimulus continuum inevitably 
introduces some departure from strictly point-type responses, 80 
that it seems reasonable to employ a measure of scalability that _ 
reflects that fact. If it could be shown, furthermore, that the theo- 
retical maximum for eta increases with an increase in the density 
of the items, as distributed along the underlying continuum, the 
validity of this index would be vindicated. 

The density factor can of course be varied in either of two ways: 
by varying the number of items, keeping the segment of the scale 
covered constant, or by varying their spread along the scale while 
keeping constant their number. In the first case, increasing the 
number of items would result in an increase in the theoretical maxi- 
mum for eta, since subjects’ responses could be compressed within 
a narrower range of scale values, relative to the total range, thereby — 
reducing the within-subjects variance. T 

The effects of varying the true spread of the items are not 48 
clear-cut, particularly since they would have to be considered rela- 
tive to the spread of subjects’ own assumed positions on the scale. 
Thus, as long as the range of the scale positions of the items matched 
the range of the positions of the subjects on the scale, eta would be 
expected to remain constant, regardless of the absolute spread of the 
item scale values on the underlying continuum. What this means, 
in effect, is that scalability has to be considered relative to the 
spread of the assumed scale values of subjects represented in the 
sample, which should, in turn, dictate the range of the scale sam- 
pled by the items. It might be noted that this point applies equally 
to the cumulative scalogram model. 

It is interesting, nevertheless, to examine the way in which eta 
would be affected by failure of the range of the scale values of the | 
items to match that of the subjects. The inclusion of items with - 
values which are substantially more extreme than the scale positions 
of any of the subjects in the sample would exert little influence , 
since such items would simply not be responded to, and thus in — 


JOACHIM F. WOHLWILL 549 


effect remain outside of the response matrix. (Indeed, it would gen- 
erally prove necessary to discard such items from the analysis.) On 
‚ the other hand, the effect of compressing the items onto а narrower 
range of the continuum than that represented by the positions of 
the subjects would be to increase the frequency of the responses to 
the extreme items. 1% can be shown mathematically that the maxi- 
mum value of eta attainable under such conditions is larger than 
under conditions of even distribution of subjects over the segment 
of the continuum covered by the items. For instance, given a set 
of five items, A, B, C, D and E, the value of etamax (i.e., assuming 
perfect scalability), for the case in which the three response pat- 
terns ABC, BCD, CDE occur with equal frequencies, is .71. It 
increases to .76, for a case where these three patterns occur with 
relative frequencies of 3:1:3, respectively. An even higher value 
of eta, .80, is obtained for a distribution of these proportional fre- 
quencies in a 1:2:3 ratio. Thus, in general, it would seem to be 
important to pay some attention to the form of the distribution of 
response patterns, and if possible to select items so as to avoid а 
piling up of responses at either or both ends of the scale. 

It is perhaps advisable here to point out again that the calculated 
scale values, i.e., the item weights obtained by Mosteller’s method, 
do not give any information concerning the true spread of the 
items along the scale, since they are devoid of any absolute Tod 
ing, both the size of the unit and the location of the origin being 
altogether arbitrarily determined. 


Illustrative Example 


Let us illustrate the procedure outlined above by reference to а 
concrete example. For this purpose, à subset of items from one of 
the Thurstone attitude scales was taken, for which the known scale 
values could serve as an external check on the validity of the scale 
values obtained by Mosteller's method. ) 

Nine items were selected from Peterson's (1931) Attitude To- 
wards Capital Punishment Scale, Form B, and administered to а 
group of 115 undergraduate students with instructions to check 
the three items with which they agreed most closely. The list of 
items used, with their respective scale values as determined by 
Peterson by Thurstone's method, are given in Table 2. Table 3 
Shows the response matrix obtained. Since item G was checked 


550 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 2 
List of Items used for Illustrative Data! 


(5.4) I don't believe in capital punishment but I’m not sure it isn’t necessary. 
(2.8) Capital punishment is not an effective deterrent to crime. 

(9.4) Capital punishment gives the criminal what he deserves. 

(0.6) Capital punishment is one of the most hideous practices of our time. 
(8.4) Capital punishment is justified because it does act as a deterrent to 


crime. 

(6.2) I think capital punishment is necessary but I wish it were not. — 

(8.5) d death of а comrade in prison embitters all the inmates against 
the state. 


(7.2) Until we find a more civilized way to prevent crime we must have 
capital punishment. ч 
(2.0) The state cannot teach. the sacredness of human life by destroying it. 

EE UL омы A He Dy CCS iP 


1 Reproduced by permission of University of Chicago Press from Peterson's (1031) Scale of 
Attitude towards Capital Punishment, Form B. Original scale values are given in parentheses. 


by only three subjects, it was decided to exclude it from this analy- 
sis (a procedure which, however questionable, may be defended 
in this instance in view of the purely illustrative purpose of these 
data). Table 3 also shows the scale values (д) obtained by Mos- 
teller's procedure, after four iterations. In order to render them 
comparable to those given by Peterson (shown in the y; row in 
Table 3) they were subjected to a linear transformation by means 
of the equation 7 = (—8.8/7.2) т +4.3; this had the effect of 
forcing the Mosteller scale values onto the same scale as Peterson's 
(thus the values for the extreme items in rows x’ and y are identical). 

A comparison of the two sets of scale values shows reasonably 

close agreement, although one reversal does appear: item F scales 
higher than item Н, contrary to the results from Peterson’s judges. 
(It appears that our subjects responded more to the tone of the 
first part than that of the second part of these expressly ambivalent 
sentences. A similar tendency on item A might also account for the 
substantial discrepancy between the values obtained from the two 
procedures for that item.) 

Finally, applying the measures of scalability proposed above to 
this set of data, we find Rep = 1 — 31/(5 X 122) = .945, indicating, 
as the response matrix itself suggests, that the data approximate 
closely to scalability. Calculating eta, on the basis of the Mosteller 
scale values obtained, we obtain a value of .878. While it hardly 
seems very fruitful to insist on a sharp cut-off point for determining 
the presence or absence of scalability, as measured in terms of eta, 
8 plausible and convenient rule of thumb might be to set a lower 


= H ооч HUOW> 


JOACHIM F. WOHLWILL 551 


bound of .70 (ie. within-subjects variance = 50 per cent or less 
of the total) as the minimum acceptable. Thus by this index, too, 
the present data show satisfactory homogeneity. 

Results for r = 2. Since subjects to whom the questionnaire had 
been administered were asked, after checking their three preferred 
items, also to rank-order them, it was possible to reanalyze the 
data, taking account only of a subject’s first two choices, i.e., with 
т = 2 instead of 3. The scale values resulting from an application 
of Mosteller’s method to the new response matrix are in general 
closely comparable to those found for r = 3; more particularly, 


TABLE 3 


Response Matriz and Item Scale Values for Illustrative 
Data on Altitude towards Capital Punishment 
у ل‎ eee 


Response Patton А 


Рет Boom 4 (0. БЕН ЈА Bo DITE 
a* —10.1 $0 XX Xo. See 
b —9.4 з Xx — X — 1 
c -8.5 1o xoa Leste Е НЫ 
a" стт 5.9281 — Ж aie Yh 0 
е —4.6 l.-— 0X os Е, а 1 
A сат O E = О ЕУ Уан 0 
g —818 1 oXov Жинел = МИНДЕ 
h —92.4 $ — — XX НК MEN 
i 21 Be eg EN эр =... 
j +0.1 Р у= ЛЕУ ОНАН, 
k +0.3 1 —. — X EE E 
r --0.8 suu рар 97 = 0 
m 41.0 ga) od! oe OK eee EE 1 
n +2.4 $ иа ee САРТ T, 1 
o +3.4 l1 SAS ee a E 2 
р* 44.8 co E S dris ч 
q 45.6 کا کے کے و‎ кү + X — Eid 
d 45.9 t URL EET A 
ы +7.2 96: C. SSS ES eir Р MO 
N 

Total-- 12  , gi 46 60 45 02 5 90 

Total — 105 si 66 52 67 50 57 8 

Item Scale Values 
" Mosteller ал -3.4 —2.5 —1.8 +06 +20 +22 +3.0 
an Mosteller дл 85 74 66 36 19 16 06 
(converted) 


и Peterson (4 s4 63 73 5A 25 20 0.6 


* Perfect-scale patterns 
* Determined from z-values (Mosteller item-weighta) 


552 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


the ordering of the items remained unchanged. Item F was, how- 
ever, brought considerably closer to item Н in this analysis—an 
interesting finding, since these two items involved the one reversal 
in the ordering of the items by Mosteller's method in comparison 
to that determined by Peterson. The value of Rep obtained from 
this second analysis is virtually identical with that found earlier: 
:943 vs. .945. Not surprisingly, however, eliminating the third item 
in every response pattern results in a substantial rise in the value 
of eta, which is now .95, as against .88 for r — 3. This finding indi- 
cates that eta should, in general, be evaluated relative to the size 
of r and that for many purposes a more valid measure of point- 
scalability may be obtained by restricting т to 2; in any case it 
would rarely seem warranted to increase it beyond 3. 


Conclusions 


In his treatment of scalogram analysis, Torgerson (1958) lays 
considerable stress on the determinate as opposed to probabilistic 
character of scalogram approaches such as Guttman’s, and on the 
failure of such models to give any place to error. It might seem 
that, by determining a set of numerical scale values for the items 
во as to permit the application of variance statistics to the data, 
Mosteller’s technique has managed to circumvent this shortcoming 
of the standard cumulative scalogram techniques. It is important 
to note, however, that, despite the intimate relationship of eta to 
the analysis of variance model, ordinary tests of significance are 
not applicable to this measure as employed here, since the way in 
which the item scale values—and accordingly eta itself—are deter- 
mined, from the response patterns themselves, clearly rules out the 
application of ordinary sampling statistics. In practice this limita- 
tion is perhaps not too serious, since in studies of scalability the 
interest is generally in the extent of approach to perfect scalability, 
rather than in the departure of the response patterns from chance. 
Nevertheless this feature would seem to render more difficult the 
further mathematical elaboration of this model, and its eventual 
integration with other statistical models. 

On the other hand, it is always possible, and indeed desirable, to 
apply the score values calculated for a set of items to a new sample 
of respondents, so as to obtain a check on the value of eta. The scale 
values would now have the status of a priori, independently deter- 


JOACHIM F. WOHLWILL 553 


mined values, rendering them amenable to the application of sam- 
pling statistics. The same would be true wherever some prior theo- 
retical or empirical grounds exist for assigning scale values to the. 
items, so that eta may be determined in terms of these values, and 
Rep calculated on the basis of the resulting item order. An obvious 
сазе in point is our illustrative example, where the scale values 
obtained by Peterson could equally well have been used to deter- 
mine scalability. The same would apply to responses to stimuli 
scalable on some independent objective criterion, eg., complexity 
measured in terms of informational content. 

The question of the place of Coombs’ unfolding technique 
(Coombs, 1950) in the present context may be raised. As Torgerson 
(1958) has shown, a modification of Coombs’ general procedure, 
in which a subject is asked only to rank a subset of 7 items, rather 
than the entire set of n items, is closely similar to the point-scaling 
model outlined here; indeed, whenever such data are collected, they 
can readily be analyzed by Mosteller’s method, simply by ignoring 
the item rank orders, While this may seem to involve throwing 
away potentially valuable information, there are no analytic tech- 
niques presently available, to this writer's knowledge, for assessing 
the extent of the scalability of rank-order data conforming to 
Coombs’ essentially deterministic model. This situation may reflect 
the inclination of Coombs and his followers to turn to multi-dimen- 
sional models of scaling in order to handle cases which fail to yield 
a perfect fit to the unidimensional model. 

A rather different model of sealing which is likewise closely re- 
lated to Mosteller's technique is Thurstone's (1929) model of simi- 
lar reactions, or similar attributes. Like Mosteller's, it starts from 
an analysis of the fy matrix. These frequencies are then translated 
into absolute inter-item separation values, under the assumption 
that the probability of two items being jointly endorsed is a Gaus- 
sian function of their linear separation. As Torgerson (1958) has 
noted in his discussion of this model, its essentially probabilistic 
character allies it more closely to the latent-structure type of scal- 
ing model than to the scalogram type. This very circumstance would 
probably render the application of reproducibility meu іо data 
Sealed according to this model somewhat inappropriate. Further- 
more the model as such applies explicitly only to the measurement 
of similarity among pairs of items in n-dimensional space, without 


554 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


any assumption of a unidimensional continuum underlying the 
items being made explicit, as Torgerson points out. As a result, no 
rational basis for determining the ordering of the items along such 
a continuum is provided (indeed in his illustrative example Thur- 
stone establishes this order on the basis of another, altogether ex- 
traneous set of scale values). 

In conclusion, let us examine briefly the applicability of the point 
model to psychological problems. It would seem that in any pre- 
ferential-choice situation, whether involving attitudes, interest, 
aesthetic judgments, food preferences, etc., the model is potentially 
applicable, if there are reasons to suspect that the stimuli from 
which the choices are made can be ordered along a single con- 
tinuum. The appropriateness of the model to the Thurstone atti- 
tude scale in particular needs no further emphasis; it seems that for 
such scales a point model will in general provide a much better fit 
than a cumulative one? 

One further realm of application to which the writer would like 
to call attention is the area of developmental psychology. While 
much has been written, both pro and con, concerning the question 
of the orderliness and sequential patterning of developmental 
changes, empirical research ostensibly aimed at this question has for 
the most part relied quite inappropriately on cross-sectional com- 
parisons of means, frequencies, ete. Scalogram analysis, both of the 
cumulative and the point type, appears to be ideally suited to the 
study of developmental sequences. Recognizing this, a number of 
workers have indeed begun recently to apply the cumulative model 
to the development of cognitive skills and abilities, The point 
model, on the other hand, may be expected to prove of similar value 
in the investigation of developmental sequences in such areas ав 
interests, preferences, aesthetic judgments and the like, as well as 
in situations where an individual at a given developmental level 
displays one and only one mode of behavior, as in the observational 
study of motor development in infancy. A more extended and sys- 


з That this need not be true for eve Thurstone scale is suggested by the 
results obtained from an аі of а 9-item scale гати {гот 
Forms А апа B of a recently published scale of attitudes towards physical 
education (Richardson, 1960). The response patterns obtained (from 115 sub- 
jects) clearly showed that the items deviated very considerably from scal- 
ability, so that the application of the procedures outlined in this paper was 
not even attempted. 


Se е ——— Ва ———— 
: — IEEE —  —Ó,— Н 


JOACHIM F. WOHLWILL 555 


tematic treatment of the application of scalogram techniques in 
developmental psychology is currently in preparation. 


REFERENCES 


Coombs, C. Н. “Psychological Scaling Without a Unit of Measure- 
ment.” Psychological Review, LVIL (1950), 145-158. 

Green, B. F. “A Method of Scalogram Analysis Using Summary 
Statistics.” Psychometrika, XXI (1956), 79-88. 

Mosteller, F. “A Theory of Scalogram Analysis Using Non-cumu- 
lative Types of Items.” Report No. 9. Laboratory of Social Rela- 
tions, Harvard University, 1949. y 

Peterson, R. С. A Scale for Measuring Attitude towards Capital 
Punishment. Chicago: University of Chicago Press, 1931. 

Richardson, C. E. “Thurstone Seale for Measuring Attitudes of Col- 
lege Students toward Physical Fitness and Exercise.” Research 
Quarterly of the American Association for Health and Physical 
Education, XXXI (1960), 638-643. 3 

Thurstone, L. L. “Theory of Attitude Measurement." Psychological 
Review, XXXVI (1929), 222—241. f 

Torgerson, W. S. Theory and Methods of Scaling. New York: John 
Wiley & Sons, 1958. alios M 

White, B. W. and Saltz, E. “Measurement of Reproducibility. 
Psychological Bulletin, LIV (1957), 81-99. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 3, 1963 


THE COMPARATIVE VALIDITY OF THE WAIS AND THE 
STANFORD-BINET WITH COLLEGE FRESHMEN 


A. STEVEN GIANNELL 


State University College at Potsdam, N. Y. 
AND 
CECIL M. FREEBURNE 


Bowling Green State University 


Problem 


Ox the basis of a study which he conducted before releasing the 
Wechsler Adult Intelligence Scale (WAIS), Wechsler (1955) raised 
à question as to the possible lower validity of the Stanford-Binet 
(S-B) as compared with the Wechsler Adult Intelligence Scale. 
Using 52 white, male, adult subjects from the Annadale Reforma- 
tory, he found a WAIS mean IQ of 95.4 and an S-B mean of 1004. 
The standard deviations were 11.7 and 17.3, respectively. Wechsler 
indicated that the subjects of the Annadale sample were inferior to 
the general population in education (52 per cent had less than eight 
years of schooling and no one more than.twelve, whereas in the gen- 
eral population the proportion of people with eight years or less of 
education was between 25 per cent and 32 per cent). He reasoned 
that, since education is highly correlated with 1Q obtained on the 
WAIS and on the S-B, the mean IQ of this reformatory sample 
should be lower than that of the general population, and the stand- 
ard deviation more restricted. He concluded that the WAIS data 
conformed to such expectations, whereas the S-B data did not. 

The present investigators checked the difference between the 
WAIS and 8-В mean IQ's yielded by the Annadale sample and 
found it to be significant beyond the one per cent level. The differ- 
ence in years of education between the Annadale sample and the 
WAIS standardization sample reported in the WAIS Manual was 
Also significant beyond the one per cent level, except for the group 


557 


558 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


who had twelve years of education. The percentages of the general 
population having given numbers of years of education are, as 
shown in the WAIS Manual, very close to those of the WAIS 
standardization sample; thus Wechsler's inference seemed to be 
justified that a significant differenee between the WAIS stand- 
ardization sample and the Annadale sample could be taken as an 
indication of a significant difference between the Annadale sample 
and the general population also, the samples being randomly se- 
lected. In addition, several studies report significant correlation 
between years of education and IQ’s obtained on intelligence tests 
of the Wechsler and S-B type (Burt, 1927; Wellman, 1940; Garrett, 
1946; Altus, 1949; Lorge, 1945; Wellman, 1945). The evidence on 
these two points seemed to provide a basis for Wechsler’s question 
as to the possible higher validity of the WAIS as compared with 
the S-B. 

However, studies involving a comparison between the Wechsler- 
Bellevue Scale, from which the WAIS was derived, and the 8-B 
suggested a different explanation of the discrepancies between IQ's 
yielded by the two scales. These studies found consistently that 
about or above the mean IQ of 100 the S-B yielded significantly 
higher scores than the Wechsler-Bellevue (Anderson, 1942; Mitchell, 
1942; Sartain, 1946). The discrepancy was explained on the basis 
of the different standard deviations and of a lack of comparability 
between the samples used in the standardization of the two tests 
(Anastasi, 1955). An indication of what may be expected when 
the WAIS and the S-B are compared comes from a study conducted 
by Goolishian (1956). He found that the WAIS yielded lower scores 
when compared with the Wechsler-Bellevue. Thus, even higher 
discrepancies may be expected between the WAIS and the 8-В than 
between the Wechsler-Bellevue and the 8-В. 

The present writers were interested primarily in the relative 
validities of the WAIS and the S-B at the college freshman level. 
If an upper, a middle, and a lower portion of a distribution of 
freshmen were obtained on the basis of their scores on the Ameri- 
can Council on Education Psychological] Examination for College 
Freshmen (ACE) and first semester grades, would the S-B fail 
to discriminate among the upper, middle, and lower group 55 it 
seemed not to discriminate between the Annadale sample and the 
general population, or would the S-B simply yield higher Scores 


t 


GIANNELL AND FREEBURNE 559 


then the WAIS at the three levels while showing about the same 
power to discriminate among the three levels? 

Specifically, the following four working hypotheses were formu- 
lated: 


(1) The WAIS means of the upper, the middle, and the lower 
groups differ significantly from each other, while the S-B means 
of the three groups do not differ significantly from each other; i.e., 
the WAIS is valid for college freshmen while the S-B is not, within 
the limits of the present investigation. 

(2) The correlation coefficients between the WAIS and the 
chosen criteria of selection (ACE and grades) is significantly 
higher than the correlation coefficients between the S-B and the 
criteria of selection, when the three groups are combined; i.e., the 
WAIS is significantly more valid for college freshmen than the 
S-B in terms of over-all correlation with the selection criteria. 

(3) The correlation coefficients between the WAIS and the 
criteria of selection are significantly higher than the correlation 
coefficients between the S-B and the criteria of selection, when each 
of the three groups are taken separately; i.e., the WAIS is signifi- 
cantly more valid for college freshmen than the S-B in terms of 
the correlations at the three levels. 

(4) For each of the lower, the middle, and the upper groups 
the IQ means on the S-B are significantly higher than the IQ 
means on the WAIS. 

Besides testing these four hypotheses, the present study was 
intended to check possible differences in the performance of males 
and females on both WAIS and 8-B, and possible differences be- 
tween the Verbal and the Performance Seale of the WAIS. 


Method 


The subjects were selected from a population of 1400 freshmen 
who entered Bowling Green State University in the Fall Semester, 
1956. The sample was obtained at the beginning of the Spring 
Semester in the following manner. The subjects’ ACE scores were 
used to obtain the upper 10%, the middle 10% and the lower 
10% of the population, in equal number of males and females at 
each level. The upper 10% was obtained by taking the freshmen 
above the 87th percentile, the middle 10% by taking those between 


560 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


the 47th and the 52nd percentile; and the lower 1076 by 
those below the 12th percentile. 

From these three strata of 70 males and 70 females each, a fur 
ther selection was made according to age and first semester grad 
point average. This selection was accomplished by taking the fresh- 
men between the ages of 18 and 21 who had a grade-point average 
(on a 40 basis) above 2.80 in the upper 10% of ACE scores, рез 
tween 1.70 and 2.79 in the middle 10% of ACE scores, and below 
1.69 in the lowest 1096 of ACE scores. 

Аз а result of this selection, a total number of 74 subjects re- 
mained in the upper stratum (27 males and 47 females), a total o 
76 (35 males and 41 females) in the middle stratum, and a total of 
84 (42 males and 42 females) in the lower stratum. From each 
these strata a random sample of 20 males and 20 females was 
lected with the aid of а table of random numbers. 

Other variables, such as geographical origin, urban-rural resi- 
dence, or race and socio-economic status might also determine per- 
formance on intelligence tests. It was thought that random selection 
would provide an adequate statistical control for these variables. 
The procedure for selection of the subjects was carried out in such 
a way that the test administrator did not know to what group the 
subjects belonged, either during testing or during scoring of the test. 

The total number of subjects to be tested was 120 (40 at each - 
level), but the actual number of subjects who could be tested was | 
109. Seven subjects dropped out of the university, two were sopho- 
mores and had been included in the freshmen population by 10085 _ 
take, one subject had already taken the WAIS, and one refused to 
take the tests. For the subjects tested the mean age (at the time 
of the testing) and the mean of the ACE scores and of the first 
semester grades are reported in Table 1. The differences between 
the means of the upper and the middle, of the upper and the lower, 
and of the middle and the lower groups on both ACE and grade - 
were tested by means of the t test. All differences were found to be 
significant beyond the one per cent level. 

All the subjects were tested by the same examiner bell 
February 22 and June 3, 1957, in the order in which they made 
themselves available. The testing was done in the Psychology. 
Clinic at Bowling Green State University. Both WAIS and 8-Bj. 
Form L, were administered in the same session, which lasted an 


GIANNELL AND FREEBURNE 561 


ТАВГЕ 1 


Comparison of Age, ACE Scores and Grades for 
Three Groups as Established by ACE Scores* 


Groups N Age ACE Grades 
Upper 38 18.24 63.65 3.31 
Middle 36 18.33 50.05 2.22 
Lower 35 18.69 6.68 1T 


* The differences between the means of the upper and middle, the upper and the lower, and 
the middle and the lower groups, are all significant beyond the one per cent level of confidence. 


average of two hours. Half of the subjects were administered the 
WAIS first and the S-B second, while the sequence was reversed 
for the other half. The two tests were presented to the subjects as 
if they were a single unit. The S-B was started at year fourteen for 
all subjects. The Vocabulary subtests of the WAIS and of the S-B 
were administered following each other with the thought that this 
would help in giving the impression that a single test was being 
administered instead of two. The scoring of all the tests was done 
by the examiner. A professor of clinical psychology at Bowling 
Green State University did a reliability check on every nth WAIS 
protocol (12th, starting from 100 and counting back) in order to 
discount possible subjectivity involved in the scoring of Compre- 
hension and Similarities. Independent scoring by these two men 
indicated very high agreement (an r of .923 for Comprehension and 
an r of .997 for Similarities). An r of 1.00 was found for all year 
levels of the S-B when the examiner's scoring was checked by 
another professor of clinical psychology at Bowling Green State 
University, who scored every fifth Stanford-Binet protocol. 


Results 


Test of the first Hypothesis. The t test was applied to test the 
Significance of the differences between the means of the upper and 
the middle, the upper and the lower, and the middle and the lower 
groups, for the WAIS and the S-B separately. For this analysis 
Taw scores were used, i.e., scaled scores for the WAIS and sum 
of months for the S-B. Significance beyond the one per cent level 
Was found for both WAIS and S-B means of the three groups. The 
F test showed the variances of the groups to be homogeneous. The 
data are shown in Table 2. 


562 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 2 
Comparison of WAIS and Binet Scores for Three Groups 
Established by Scores on ACE 

WAIS t ЖААР F* SB t P Fe $ 
Upper & Upper & 
Middle 10.49 <.01 1.07 Middle 9.43 <.01 1.13 

& Upper & 
Lower 18.01 «.01 1.15 Lower 18.96 <.01 1002 | 
Middle & , Middle & - 
Lower 8.15 <.01 1.24 Lower 8.83 «.01 1.06 
WAIS Mean SD Variance S-B Mean SD Variance 


Upper 145.50 8.69 75.51 Upper 243.53 10.51 120.80 
Middle 124.39 8.37 70.02 Middle 218.84 11.72 137.25, 
Lower 107.03 9.31 86.66 ^ Lower 196.26 11.39 129.79 | 


* None of the values of F, calculated for the groups involved in the comparison, are significant. 


Both WAIS and S-B appeared to be valid as far as the present | 
study was concerned, and Wechsler's question as to a lack of validity 
on the part of the S-B was not supported. Since both WAIS and. 
8-В were found to be valid, the second and the third hypotheses | 
were tested to find out whether the WAIS could be considered 
relatively more valid than the S-B. | 

Test of the second and third hypotheses. Pearson product-mo- 
ment correlation coefficients were computed first by taking the 
three groups combined and then by taking the three groups sepa- 
rately. The results are reported in Table 3 along with the standard 
deviations and the ranges of the measures involved. When the cor- 
relations were computed taking the three groups separately, the 
high homogeneity of each group reduced the size of the correlation 
coefficient, When, on the other hand, the correlations were com- 
puted taking the three groups combined, the resulting heterogeneity | 
increased the size of the correlation coefficients. This heterogeneity 
was obviously greater than the heterogeneity which would have 
resulted from taking a random sample of the total population of 
freshmen. In the present investigation it will be recalled, the random 
sampling was done from the lower, the middle, and the upper strata 
of that population, with the exclusion of the portions of freshmen 
population in between. 

The high correlation coefficient of .933 obtained between the two 
criteria of selection, ACE and grades, deserves in particular а word 


GIANNELL AND FREEBURNE 563 


of caution. The magnitude of this coefficient was artificially pro- 
duced by the fact that in selecting the subjects every effort was 
made to match their ACE score with their grade-point average as 
much as possible. p 

The influence of homogeneity has to be considered when com- 

paring the results of the present investigation with the results of 
previous studies. For the purpose of testing the significance of the 

‘difference between the correlation coefficients, on the other hand, 
the correlation coefficients can be directly compared since they have 
been obtained from groups having comparable homogeneity. 

With reference to the hypothesis of over-all higher validity of 
| Һе WAIS as compared with the S-B, the significance of the dif- 
‘ference was tested between the correlation coefficient of the WAIS 
with ACE and the correlation coefficient of the S-B with ACE of ! 
the three groups combined. The same was done for the correlation 
coefficients involving WAIS with grades and S-B with grades, 


TABLE 3 


: Correlations Between WAIS IQ, Binet IQ, and ACE Scores for Three Groups 
Separately and Combined, Plus Standard Deviations and Ranges 


S-B ACE GRADES 
| 8B "Cn eee 
| U ML UML US UM T 
- WAIS .897 .882 .841 
U .395 .421 
M 535 .269 bn —.176 
L .786 . 
SB .880 835 
U .142 .301 
M .478 029 
L :300 044 
АСЕ 933 
0 .402 
о EP 
L .061 
| STANDARD DEVIATIONS AND RANGES 
| WAIS SB ACE GRADES 


Group SD Range SD Range SD Range SD Range 


U 5.21 112138 5.85 122-146 3.69 88-99 .34 280-400 
M 5.01 100-124 6.56 111-139 2.09 47-53 .30 1.782218 
An 5.57 90-113 6.35 94-122 3.22 1-12 .36 40-1 
om- 


bined 10.81 90-138 12.79 94-146 35.70 1-99 .93 .40—400 


564 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


"With reference to the hypothesis of higher validity of the WAIS 
as compared with the S-B at each of the three levels, the signifi- 
cance of the difference was tested between the correlation coefficient 
of WAIS with ACE, and the correlation coefficient of S-B with ACE 
at the upper, at the middle, and at the lower levels. The same 
operation was repeated for the correlation coefficients involving 
WAIS with the grades and S-B with grades. The above tests of 
significance were done using the 2 transformation. None of the t’s 
obtained comparing the correlation coefficients either of the three 
groups combined or of the three groups separately were large 
enough to warrant rejection of the null hypothesis. The results of 
the tests of significance are reported in Table 4. 

It appears that the WAIS cannot be considered more valid than 
the S-B, either in over-all validity or in validity at the three dif- 
ferent levels specified herein. 

Test of the fourth hypothesis. The fourth hypothesis, i.e., that 
the IQ means of the S-B are significantly higher than the IQ means 
of the WAIS at each of the upper, middle and lower levels, involved 
measures which are not directly comparable, i.e., 19” yielded by 
two different intelligence tests in accordance with somewhat dif- 
ferent scoring procedures. However, the purpose of the hypothesis 
under consideration was that of testing the significance of the 
difference between the two IQ measures taken at face value. Аз 
а consequence, the test of significance was carried out under the 
assumption that the IQ's obtained with the two tests are scaling 


TABLE 4 
Tests of the Significance of Differences Between Selected Correlations 


ooo 


WAIS, ACE 8-B, ACE 


Group r r t P 

Combined .882 ‚880 .10 .9204 
Upper .421 .142 1.31 ‚1902 
Middle ` .269 .478 1.47 .1416 
Lower .255 .300 ‚51 6100 

WAIS, GRADES 8-В, GRADES 

Group r М [ Р 

Combined „841 ‚835 .29 .7718 
Upper .485 .301 1.1 .2584 
Middle —.176 ‚029 1.14 2542 


GIANNELL AND FREEBURNE 565 


TABLE 5 
Comparison of WAIS and Binet IQ Scores for Three Groups 


WAIS Stanford-Binet 
Group M SD Variance M SD Variance t Pp F* 


Upper 123.11 5.21 27.15 135.24 5.86 34.29 12.08 <.01 1.26 
Middle 110.47 5.01 25.14 121.64 6.56 43.06 11.50 <.01 1.71 
Lower 100.06 5.57 31.03 107.94 6.36 40.40 11.58 <.01 1.30 


* None of the values of F, calculated for the groups involved in each comparison, are significant. 


the same or similar dimensions, and that therefore it would be 
legitimate to apply the t test. Previous studies appear to have made 
this assumption also (Anderson, 1942; Sartain, 1946). 

The mean IQ of the S-B was 7.88 IQ points higher than the 
mean IQ of the WAIS for the lower group, 11.17 points higher for 
the middle group, and 12.3 points higher for the upper group. All 
these differences proved to be significant beyond the one per cent 
level. Again F ratios for the group variances were not significant. 
The data are presented in Table 5. 

The question of sex difference. The t test was applied to the 
difference between the means of the males and of the females for 
all three levels combined on both WAIS and S-B. For this test raw 
Scores were used, i.e., weighted scores for the WAIS and sum of 
months for the S-B. The WAIS raw score mean for the 58 males 
was 127.55 (standard deviation, 16.99) and for the 51 females it 
was 124.61 (standard deviation, 19.08). The value of t was 84 
(P = .4010). The value of F (1.26) showed the variance of the 
groups to be homogeneous. On the 8-В the raw score mean for the 
58 males was 219.20 (standard deviation, 24.87). The value of t 
was .04 (P — .9680). The value of F (1.36) showed the variance of 
the groups to be homogeneous. 

The difference between Verbal and Performance Scale of the 
WAIS. For all three levels combined, the IQ mean for the Verbal 
Scale was 111.32 (standard deviation, 10.49), and for the Perform- 
ance Scale 110.33 (standard deviation, 11.70). The value of t was 
114 ( P — 2542). The value of F (1.24) showed the variance of 
the groups to be homogeneous. 

The correlation between the two scales was found to be .67; 
Wechsler reports .77 (Wechsler, 1955). 


566 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Discussion 


Within the limits of the present study, the discrepancies in IQ | 
magnitude found between WAIS and S-B do not allow one to ques- Я 
tion the validity of either test. 

The fact that the WAIS takes into consideration the age tato 
while the 8-В does not consider the age factor beyond age sixteen, 
cannot explain the discrepancies in IQ between the two intelligence - 
tésts as far as the subjects of the present investigation are con- 

` cerned. The mean age for these subjects was 18.24 for the upper 
group, 18.33 for the middle group and 18.69 for the lower group. 
'The age of these subjects is below that at which the curve of 
mental growth begins to fall off. ; 

The discrepancies can be explained partly on the basis of th: 
different standard deviations (15 for the WAIS and 16.4 on tht 
S-B) yielded by the two tests in their standardization, and partly 
on the basis of the differences in standardization samples. 

The WAIS standardization sample is more representative of the 
general population as to age, race, urban-rural residence, and socio- 
economical status than the standardization sample of the S-B. For 
the WAIS those variables were controlled by including in the stand- - 
ardization sample white and non-white subjects, urban and rural. 
subjects, and subjects of different socio-economic status in the - 
ratios found in the 1950 census (Wechsler, 1955). 

The 8-В standardization sample, on the other hand, did not in- : 
clude colored people at all, and was somewhat biased as far a8 
urban-rural residence and socio-economic status are concerned. 

The discrepancies between WAIS and S-B IQ's found by the - 
present study are in line with those found between the S-B and 
the Wechsler-Bellevue by previous studies. In the Anderson study 

(1942) the S-B mean IQ was 10 points higher than the Wechsler- - 
Bellevue mean 1Q, and in the Sartain study (1946) it was 12 points 
higher. Both studies used college freshmen in the same age range - 
as those employed in the present investigation. The IQ means 
were 118.5 on the Wechsler-Bellevue versus 128.3 on the S-B for : 
the Anderson study, and 117.48 on the Wechsler-Bellevue versus 
129.48 on S-B for the Sartain study. А 

Ву comparing the above means with the means found in the 
present investigation (Table 5), it will be noticed that the subjects | 


GIANNELL AND FREEBURNE 567 


of the above studies would correspond to the subjects placed close 
to the upper group of the present investigation, where the dis- 
erepancies are:about 11-12 IQ points. 

The expectation based on the results of the Goolishian and 
Ramsay study (1956), that the discrepancies between WAIS and 
8-B would be greater than those found between the S-B and the 
Wechsler-Bellevue, seems not to be substantiated, at least when 
college freshmen are used as subjects. 


REFERENCES 


Altus, W. D. “The Relationship of Intelligence and Schooling When 
Literacy is Held Constant.” Journal of Consulting Psychology, 
XIII (1949) , 375-376. 7 

Anastasi, A. Psychological Testing. New York: The Maemillan 
Company, 1955. 

Anderson, E. E. “Wilson College Studies in Psychology: I. А Com- 
parison of the Wechsler-Bellevue, Revised Stanford-Binet, and 
‘American Council on Education Tests at the College Level. 
Journal of Psychology, XIV (1942), 317-326. Ў 

Burt, C. Mental and Scholastic Тезіз. London: P. S. King and Son, 
Limited, 1927. 

Garrett, H. E. “The Effect of Schooling Upon IQ; a Note on Lorge’s 
Article." Psychological Bulletin, XLIII (1946), 72-76. 

Goolishian, H. А. and Ramsay, R. “Тһе Wechsler-Bellevue, Form 
I and the WAIS: a Comparison.” Journal of Clinical Psychol- 
ogy, XII (1956), 147-151. 

Lorge, I. ‘Schooling Makes a Difference.” Teachers College Rec- 
ord, XLVI (1945), 483-492. p F 

Mitchell, M. M. “Performance of Mental Hospital Patients on the 
Wechsler-Bellevue and the Revised Stanford-Binet, Form L. 
Journal of Educational Psychology, XXXIII (1942), 538-544. 

Sartain, A. Q. “A Comparison of the New Revised, Stanford-Binet, 
the Bellevue Scale and Certain Tests of Intelligence. Journal 
of Social Psychology, XXIII (1946), 237-239. 1 

Wechsler, D. WAIS Manual. New York: The Psychological Cor- 

oration : И 

Wellman, В. То оне Studies on ће Effects of Schooling." Year- 
book of the National Society for the Study of Education, 
XXXIX, II (1940), 377-379. i 

Wellman, B. L. Чо о of Preschool and Non-preschool Years; 
a Summary of the Literature.” Journal of Psychology, 


(1945) , 347-368. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vol. XXIII, No. 3, 1963 


PATTERNS OF NEEDS AS PREDICTORS OF 
CLASSROOM BEHAVIOR OF TEACHERS 


KENNETH H. WODTKE, IAN E. REID, NORMAN E. WALLEN, 
лхо ROBERT M. W. TRAVERS 


University of Utah 


Tum question of the relative efficacy of prediction based on pat- 
tern scores as compared to ‘status’ scores is not new (Gaier & Lee, 
1953; Osgood & Suci, 1952) but has particular relevance to attempts 
to predict behavior based on measures of motivation. In an earlier 
report, Travers, et al., (1961) demonstrated that the classroom be- 
havior of elementary school teachers could be predicted from a 
questionnaire measure of affiliation and control needs, the Stern 
and Masling Teacher Preference Schedule. In this earlier study and 
in the present investigation, the behavior to be predicted, unlike 
that reported in many studies, is naturally occurring behavior, 
namely, behavior in the classroom. Classroom situations provide 
numerous cues capable of arousing affiliation, control, achievement, 
and other needs, and hence might be expected to lead to behavior 
related to the satisfaction of each of these needs. An important 
point to note is that the problem involves prediction in a situation 
in which several needs may each be aroused, a fact which may com- 
plicate considerably the prediction problem. Under these conditions 
two main approaches to prediction are possible. These may be 
described as follows: 

1. The hypothesis may be set up that the amount of teacher 
behavior related to the satisfaction of a given need is related to 
the strength of that need. This follows the typical paradigm used in 
most prediction studies, and was the hypothesis tested in our earlier 
report (Travers, et al., 1961). 


570 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


2. Àn alternative hypothesis is that the amount of behavior 
emitted related to a particular need depends, not only on the abso- 
lute strength of the need, but also on the relative strength of that 
need with respect to other needs. Thus, according to this hypothesis, 
а teacher with a high control need might show little behavior related 
to this need if the teacher were characterized by an even stronger 
need for affiliation. This hypothesis assumes that the relative strength 

of needs determines the behaviors that are manifested. 


The problem investigated here is to determine whether need- 
related teacher behavior can be predicted from a pattern analysis 
of need scores and, if prediction is possible, whether a pattern 
analysis of need scores improves prediction over predictions made 

. only from independent sets of scores. 


Method 


The subjects used in the study were 118 women elementary school 
teachers from five different schools in Salt Lake City, and two 
schools in a nearby community. The schools were selected to give a 
representative sample of elementary schools in the area. The range 
of grades was kindergarten through grade seven; however, the 
majority of the sample came from grades one through six. 

The measure of teacher needs used to predict classroom behavior 
was the Teacher Preference Schedule (hereafter referred to as TPS) 
developed by Stern and Masling (1958). 

New scoring keys for the TPS were developed so that it would pro- 
vide four scores for achievement, affiliation, recognition, and control 
needs. The recognition scores were deleted from the analysis owing 
to the difficulty of obtaining adequate measures of classroom recog- 
nition behavior, and to simplify the pattern analysis. The TPS 
was administered in a battery with several other tests. The teachers 
were allowed to take the tests home and complete them in their 
spare time. 

The measures of the teachers’ classroom behavior were obtained 
separately by two classroom observers working independently. Each 
observer visited the classroom of each teacher for a total of at least 
two hours scattered over several visits, The visits were distributed 
over different times of the day and days of the week so that an ade- 
quate sample of teacher behavior could be obtained. During their 


KENNETH Н. WODTKE, ET AL. 571 


visits to the classrooms, each observer recorded fifty statements 
made by each teacher using a time sampling technique developed by 
Withall (1949). These statements were later classified into categories 
for achievement, affiliation, control, and management oriented state- 
ments. A teacher's score was simply the total number of her recorded 
statements classified in each category. In addition, ratings were 
made by the observer of various aspects of teacher behavior. 
After all teachers had been visited, the observers made Q-sorts of 
the teachers with respect to twenty-four dimensions of classroom 
behavior. Factor analysis of these ratings derived from the Q-sorts 
produced five factors, and factor scores were then obtained for each 
teacher by adding the scores on the Q-sort dimensions which had the 
highest loadings on each factor. The factors are defined in terms of 
the high end of the Q-sort dimensions as follows: Factor 1, warm, 
permissive; Factor 2, quiet, controlled, dull; Factor 3, high ego- 
strength, not frustrated, confident; Factor 4, spends little time work- 
ing alone; Factor 5, not concerned with learning or achievement. 
The three need scores on the TPS were converted to z-scores 80 
that they would be comparable for the pattern analysis. Since the 
distributions of these scores were already approximately normal, it 
was not necessary to normalize. 1 : 
The data provided a profile of three need scores for each subject. 
The profiles were analyzed in two Ways; first, need pattern groups 
were selected from the TPS profiles. In order for а teacher to be 
selected for a partieular need pattern group one of her need scores 
had to fall at least one standard deviation above or below her two 
other need scores, regardless of the absolute height of her profile. In 
this way a high and low need group was obtained for each need. 
The high group for a particular need consisted of subjects whose 
scores on that need fell at least one standard deviation above the 
other two needs; the low group consisted of subjects whose scores 
on that need fell at least one standard deviation below the other 
needs. The number of subjects which met this criterion for at least 
one need ranged from 9 to 17 out of the total of 118 subjects. The 
means of these groups on the classroom behavior measures were then 
compared, and t-tests computed. 
One of the problems with a pattern analysis is that extremely 
large samples are necessary in order to obtain large enough groups 
which meet the criterion for a particular pattern. For this reason & 


572 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


second analysis of the TPS profiles was made. In order to increase 
the sample size so that the total sample of 118 teachers could Бе 
used, a pattern score was developed which could be assigned to _ 
every subject. A correlational analysis was then used and this per- 
mitted a direct comparison of the pattern predictions with the pre 
dictions using independent scores. The procedure for assigning 
pattern scores was essentially the same as the procedure for differ- 
entiating the pattern groups, except that instead of a simple dicho- 
tomy of high and low patterns, a six-point scale was used and а 
pattern score was assigned for each need depending upon the extent _ 
to which a need score differed from the other two need scores. The _ 
pattern scores were assigned as follows: a 


O— when a need score was at least 1.7 S.D. below each of the other 
two needs; 
1—when a need score was between 1 and 1.7 S.D. below each of the 
other two needs; 
2—when a need score was between .3 and 1.0 S.D. below each of the 
other two needs; ч 
8—when a need score was either the middle score of the profile, or 
Ж not differ from each of the other two needs by more than +3 — 
CY «Ф 
4—when a need score was between .3 and 1.0 S.D. above each of ће | 
other two needs; 
5—when a need score was between 1.0 and 1.7 S.D. above each of 
the other two; | 


6—when а need score was at least 1.7 S.D. above each of the other 
two needs. 


I 


Results 


Both the t-test and correlational analyses indicated that the TEM 
predicted some aspeets of classroom behavior. 

The results of comparing extreme patterns are shown in Table " 1 
When the high and low need pattern groups were compared on the t 
various behavorial measures, a number of significant differences. 
were obtained for the affiliation and control variables. No consistent | 
predictions were obtained with the achievement scale, although sev _ 
eral approached the 5 per cent level of significance. The analysis 
presented in Table 1 shows that the high affiliation pattern group on 


KENNETH Н. WODTKE, ET AL. 573 


Table 1 
Mean Criterion Scores and t-tests for High and Low Need Pattern 
Groups on the Teacher Preference Schedule* 
=——Є—Є—Є———————————— 


Teacher Preference Schedule 
Measures of Low High t Low High t Low High t 
Classroom Behavior Ach Ach Aff Aff Con Con 


Factors: 
1. Warm, permissive 17.75 23.25 1.57 20.06 28.65 2.59* 28.24 17.35 4.12** 
2. Quiet, controlled, 
dull 17.08 17.17 15.00 21.12 4.03** 19.76 18.18 
3. High ego strength, 
not frustrated, 


confident 1042 10.17 10.94 13.12 1.61 13.59 10.53 1.96 
4. Spends little time 
working alone 400 4.00 3.71 4.65 171 435 3.24 1.82 


5. Not concerned with 
learning or achieve- 


ment 1225 16.07 1.99 13.12 17.65 2.92** 15.59 13.35 1.29 
Teacher Statements: 

Achievement 33.42 40.50 161 41.65 39.76 35.00 3847 

Affiliation 1.58 3.67 1.99 2.88 1.82 1.26 2.35 1.06 2.19* 

Control 28.00 21.42 1.90 26.41 21.41 1.66 21.71 26.29 1.64 

Management 22.50 19.92 19.53 22.82 1.38 24.29 22.76 

tings: 

Kindly 7.92 8.42 8.12 11.06 3.06** 10.94 7.88 3.73** 

Stimulating 833 $825 9.59 865 121 8.76 7.82 1.00 

Achievement 9.92 9.33 9.82 8.53 1.84 9.94 10.41 

Affiliation 6.58 7.42 635 871 2.91** 8.82 5.71 4.04" 

Control 1017 8.75 1.37 10.12 6.94 3.66** 7.53 10.59 3.87 
af 22 32 32 


E v __ лш ame 


^ t-test less than 1.00 were omitted from the table. 

* Significant at the .05 level. 

** Significant at the .01 level. 
the TPS was more warm and permissive, more quiet, controlled, and 
dull, less concerned with achievement, and was rated more kindly, 
more affiliative, and less controlling than the low affiliation pattern 
group. Most of the differences were in the opposite direction for the 
comparisons between the high and low control pattern groups. The 
high control group was less warm and permissive, gave fewer state- 
ments classified as affiliative, was rated as less kindly, less affilia- 
tive, and more controlling than the low control group. These com- 
parisons indicated that the control and affiliation variables are 
negatively correlated. The TPS pattern scores for affiliation and 
control correlated —.39, the affiliation and control ratings —.50, 
and the affiliation and control verbal statements —.25. 

The correlational analysis of the TPS pattern and conventional 

scores yielded results quite similar to the comparisons of need pat- 


574 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT E 


tern groups. As can be seen in Table 2, the pattern and non-patten 
scores of achievement motivation again failed to predict the b 
havioral measures. The affiliation scale correlated positively wii 
the kindly and affiliative ratings and negatively with control гай 
while the control scale correlated positively with the behavio 
measures of control and negatively with affiliation. i 
A comparison of the correlations in Table 2 also shows that the 
pattern score gave no improvement to the predictions, The coi 
tions increased slightly when the pattern score was used for predict 
ing affiliative behavior; however, the conventional non-pati T 
Score actually predicted slightly better for the control variables. 
Unfortunately, the extent to which the need pattern scores co а 


TABLE 2 


Correlations Between the TPS Conventional and Pattern Need Scores 
and the Teacher Classroom Behavior Measures* 


Measures of Teacher Preference Schedule 
Classroom Behavior Ach Aff 


Con- Con- Con- 
ventional Pattern ventional Pattern ventional Pattern 
Score Score Score ^ Score Score Score 


Con 


Factors: 
1. mum. permis- 
уе — .03 .09 .17 .25 —.46 TE 
2. ud! con- 
dull —.00 —.01 14 .28 —,14 EM 
3. High ego 
strength, not 
frustrated, 
confident —.10 01 - -. eem 
4. Spends little 4 N x 
bes working 
lone 00 Р - "7 
5. Not concerned w ж a a 
with learning 
or achievement ‚05 04 = (7 
bred Statements: zi A a 
chievement .09 .05 —.04 —.07 .08 
Affiliation .06 .05 .08  —.01 -.12 
оа : 06-12 .04  —.13 .82 
anagemen = 13) vie: - = 
Ratings. 03 .02 .10 14 
Kindly —.00 .04 20 —.37 
Stimulating = (05 8 ови 16-0 
Achievement —.08  —.02 —.20 —.18 .10 
Affiliation 03 ` 05 COAT UNE 80 
Control 03 ~.08)" —.98 —.36 :45 


*N = 118,т. = 18 


KENNETH Н. WODTKE, ET AL. 575 


improve predietion was limited by the high correlations between 
the pattern scores and the independently derived need scores. The 
pattern scores and the independent need scores correlated .61, .65, 
and .72 for the TPS achievement, affiliation, and control scales, re- 
spectively. In order to determine whether the need pattern scores 
were contributing anything unique to the predictions of classroom 
behavior beyond that being contributed by the independently de- 
rived scores, partial correlations were computed between the TPS 
affliative and control pattern scores and the classroom behavior 
measures holding the independent scores constant. The correlations 
are reported in Table 3 for the behavorial measures with which 
some relationships were expected. The partial correlations indicate 
that the pattern score added nothing to the predictions based upon 
the TPS control scale, but that the pattern score did make a unique 
contribution to the predictions based upon the TPS affiliation scale. 
The correlations are admittedly quite small but, nevertheless, the 
finding is a puzzling one. 

Odd-even, split-half reliability coefficients were computed for the 
affiliation and control scales of the TPS using the Spearman-Brown 
formula, Although the TPS affiliation and contro] scales contain 
only 15 and 19 items, respectively, the obtained reliabilities were .64 
for affiliation and .80 for control using the total N of 118 teachers. 


Measures of 
Сов Behavior Teacher Preference Schedule 
Factors: Af Con 
1. Warm, permissive .19 —.05 
2. Quiet, controlled, dull .25 
3. High ego strength, not frustrated, 
confident .18 .06 
4. Spends little time working alone —.09 
5. Not concerned with learning or 
achievement .07 
Teacher Statements: 
Аў Control —.21 —.09 
ting: 
Kindly 15 IT 
Achievement —.07 10 
Affiliation 11 —.16 
—.28 05 


Control ` y " 3 


*N = 118, ros = 18 


576 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


The results indicate that no great advantage is provided in the 
situation studied by the use of the need pattern score. Generally, 
the independent need score predicted as well as the pattern score 
with the exception of the TPS affiliation scale. The above conclu- 
sion, however, should be qualified in several respects. First, the 
pattern scores and the independently derived scores were correlated 
and this placed a definite limitation on any improvements in predic- 
tion which could be expected from the pattern score. Secondly, 
despite these correlations between the two scores there is a tentative 
indication that the TPS affiliation pattern score did make a slight, 
but unique contribution to the prediction of teacher behavior. This 
finding suggests that the need to control may determine behavior 
independently of other needs, while the occurrence of affiliative be- 
havior may depend upon both the absolute strength of the affiliation 
need, and its strength relative to the other needs in an individual’s 
profile. For the purposes of predicting the behavior related to some 
needs, a multiple prediction utilizing information concerning both 


the absolute strength and the relative strength of the need may pro- 
duce the best prediction. 


Summary 


An attempt was made in this investigation to predict the class- 
room behavior of 118 female elementary school teachers from an 
instrument designed to measure their needs (ie., the Stern and 
Masling Teacher Preference Schedule). The TPS was scored for 
three needs: achievement, affiliation, and control. In addition to 
the conventional independent need scores, a pattern analysis of 
the profiles of needs was performed to test the hypothesis that the 
relative strength of needs determined behavior in conjunction with 
the absolute strength of the needs, 

The criterion measures of teachers’ classroom behavior were ob- 
tained by two observers. Each observer collected fifty verbal state- 
ments from each teacher using a time sampling technique developed 
by Withall (1949). These statements were later classified into cate- 
gories for achievement, affiliation, control, and management oriented 
statements. The observers also conducted Q-sorts on the teacher 
sample with respect to 24 dimensions. These data were factor- 
analyzed providing five factor scores. The teachers were also rated 
by the observers on five other dimensions: kindly, stimulating, 


KENNETH Н. WODTKE, ET AL. 57 


achievement motivated, affiliation motivated, and control motivated. 

In general the conventional independently derived need scores 
predicted as well or better than the pattern scores with the excep- 
tion of the TPS affiliation scale where there is an indication that 
the pattern score contributed to the predictions. This finding sug- 
gests that the relative strength of needs may determine the be- 
havior emitted to satisfy some needs, but that other needs may 
function independently. 


REFERENCES 


Gaier, E. L. and Lee, Marilyn C. “Pattern Analysis. The Configur: 
Approach to Predictive Measurement." Psychological Bulletin, 
L (1953), 141-149. ^ $ 

Osgood, C. E. and Suci, G. J. “A Measure of Relation Determined 
by Both Mean Difference and Profile Information.” Psychologi- 
cal Bulletin, XLIX (1952), 251-262. 1 

Stern, С. С. and Masling, J. M. “Unconscious Factors in Career 
Motivation for Teaching." Final Report, U. S. Department of 
Health, Education, and Welfare, Office of Edueation Contract 
No. SAE 6459, 1958. 

Travers, В. M. W., Wallen, N. E., Reid, I. E., and Wodtke, K. H. 
“Measured Needs of Teachers and Their Behavior in the Class- 
room." Final Report, U. S. Department of Health, Education, 
and Welfare, Office of Education Contract No. 444 (8029), 1961. 

Withall, J. G. “The Development of a Technique for the Measure- 
ment of Social-Emotional Climate in the Classroom.” Journal of 
Experimental Education, XVII (1949), 347-361. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 3, 1963 


ELECTRONIC COMPUTER PROGRAMS AND 
ACCOUNTING MACHINE PROCEDURES 


Edited by 
WILLIAM B. MICHAEL 
University of California, Santa Barbara 


Scoring Teacher-Made Tests with the IBM 1620. Совт 
STAFFORD AND JOHN BIANCHINI .... eee n 
А Comparison of Two Computer-Based Procedures of Orthog- 
l onal Analytic Rotation with a Graphical Method when a 
General Factor Is Present. MARY L. TENOPYR AND WILLIAM 
B. MICHAEL din E ve УЕ 
The Wherry-Doolittle Test Selection Method in Fortran. 
NATHAN JASPEN ....( «2295» Кентай eo aie a E 


587 


Ix view of the tremendous advances that have been made in the 
adaptation of electronic computers and accounting machines to the 
processing of statistical data, sections of the Spring and Autumn is- 
sues of EDUCATIONAL AND PSYCHOLOGICAL MEASURE- 
MENT are devoted to the publication of such programs as are 
appropriate to psychometrie procedures. Programs relevant to such 
problem areas as factor analysis, item analysis, multiple regression 
procedures, the estimation of the reliability and validity of tests, 
pattern and profile analysis, the analysis of variance and co- 
variance, discriminant analysis, and test scoring will be consid- 
ered. Customarily a program should be expected not to exceed 
six or eight printed pages. Manuscripts of four or fewer printed 
pages are preferred. Each manuscript will be carefully reviewed as 
to its suitability and accuracy of content. In some instances an at- 
cepted paper may be returned to the author for possible revisions 
or shortening. The cost to the author will be fifteen dollars per page 
for regular running text. The extra cost of the composition of tables 
and formulas will be added to the basic rate. Manuscripts received 
up to November first will be considered for the Spring issue; manu- 
scripts received between then and May first will be considered for 
the Autumn issue. 


All correspondence should be directed to 
William В. Michael 
Professor of Education and Psychology 
University of California, Santa Barbara 
University, California 


SCORING TEACHER-MADE TESTS WITH THE IBM 1620 


CURT STAFFORD 


San Jose State College 
AND 
JOHN BIANCHINI 


University of Illinois 


Introduction 


Ar San Jose State College the Testing Office has begun to use the 
IBM 1620 to score teacher-made tests, Instructors obtain а deck of 
punched, specially-printed mark sense answer cards from the 
Testing Office. Students record answers with an electrographic pen- 
il, and, after mark sense punching, the answer cards become direct 
input to the computer. The program, which is quite general, is of 
cient capacity to accommodate a wide range of tests. 

The information provided by the program is as follows: (1) for 
each student—student identification number, the number of items 
Tight, the number of items wrong, the sum of rights plus wrongs, and 
the raw score as per scoring formula; (2) for the set of scores— 
Dumber of tests, sum of raw scores, sum of squares of raw scores, 

n, standard deviation, Kuder-Richardson formula 21 reliability 
estimate, standard error of measurement, and a frequency distribu- 
tion of raw scores and of T-score conversions; (3) for each test item 

choice analysis consisting of frequency and percentage of stu- 
dents choosing an alternative. A test may contain as many as 450 


The greatest limitation is that an item may have only one correct 
Answer because the computer on campus will not read double 


Series of true-false items, or the multiple-choice alternatives must 
be given an item number rather than a letter. 


582 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT - 


The writing of the test scoring program was a relatively sim 
task. Test scoring offered as a service to faculty, however, n 
a complex task—one which demands the establishment of a s yster 
which goes far beyond the operation of the computer. This articl 
will touch upon broader aspects of the test scoring service as well: 
upon a description of the program. 


Input 


The computer recognizes three different kinds of cards in thes 
ing operation: (1) Answer Cards, (2) Key Cards, and (3) Head 
Card. ^ 
Answer Cards. As indicated previously, the Answer Card is of 
special design. Basically it is a full 27 position mark sense card 
printed front and back to permit recording answers to 54 items 
a single card. Only five marking positions per item are pro 
they are designated as A, B, C, D, and E. Choice A corresponds: 
time; choice B to 2 time, ete. Item number 1 is printed in MS colu 
27; item number 2 is printed in MS column 26, ete. Although 
item numbers are in reverse order of the MS column numbers, 
wiring puts the punch for item 1 in punch column 1, ete., to pe 
easier sight checking. 

Space for the student to print his name and other identifying in 
formation is provided on both sides of the card just beneath th 
area which senses the 9 punch. Writing with electrographic pene 
beyond the bounds of the space provided has been the major cause 
of double punching. 

Additional cards accommodate items 55—108 (card number 2) 
items 109—162 (card number 3) ; ete. | 


Answer Cards are punched as follows: 


Columns 1—54 Answers to items 1—54 (or 55—108, etc. 
Column 59 *0" (zero) 


Column 60 Card number, a digit from 1—9 
Columns 61—65 Student identification number 


Key Cards. Key Cards are Answer Cards with different cont 
punches. The Key Card layout is: d 


Columns 1—54 Answers to items 1—54 (or 55—108, etí 


STAFFORD AND BIANCHINI 583 


Columns 59—61 “КЕҮ” 
Column 62 “0” (zero) 
Column 63 Card number, a digit from 1—9 


Head Card. The Head Card details the scoring formula and pro- 
vides certain other control and identifying information. Provision 
has been made for an extremely wide range of scoring formulas, The 
instructor may use any formula of the form sR—tW, where s is а 
single digit from 1 through 9, and where ¢ is a value from 0.000 
through 9.999. R is the number of correct responses and W is the 
number of wrong responses. Thus the simple “number right” formula 
would be entered as 10000 and “rights minus one-half wrongs” would 
be entered as 10500. 


The Head Card layout is: 
Columns 1—4 “HEAD” 
Column 5 (Blank) 
Column 6 s from scoring formula 
Columns 7—10 Ё from scoring formula 
Column 11 Number of answer cards per student 
Columns 12—14 Number of test items, including omis- 
sions in key 


Columns 15—17 Maximum possible score (taking scoring 
formula into account) 


Columns 18—75 Instructor identification 


Information in the Head Card beyond column 17 is available for 
whatever identification the instructor wishes to enter. The Testing 
Office enters information which it desires to have for its annual 
Statistical report. The Head Card information, which is carried 
through the computer, is punched out exactly as it is read in. 


Output 


A. printed report is prepared for the instructor by running the 
computer output through a tabulator. An 80/80 panel will produce 
а readable report, but а specially-wired panel is used in the scoring 
service to make easier reading for faculty members. A sample 


printed report follows: 


584 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


HEAD 100002108108 SAMPLE PRINTED REPORT 


STUDENT RAW NR NW R+W 
123 43 43 65 108 


125 23 23 84 107 
130 98 98 10 108 
NO. TESTS SUM RAW SUM SQ. 

8 511 40399 
MEAN SIGMA SEM KR21 
63.9 31.1 4.2 .98 


RAW SCORE FREQ T SCORE 
98 1 61 


23 1 37 


QUEST. FREQUENCY PER CENT 
ADB O DE ПОВЕО D'E 
1 8* 100* 
2 Lotre 13 75* 13 
3 Tu US 13  75* 


(Note: * marks the keyed answer) 


Omissions are not tallied within the computer. They must be ! 
determined by noting where the total number of choices is not equal 
to the number of students tested (or 10095). Question No. 3 is an 
example of this. 


Material Requirements 


Four pieces of equipment are required to complete the processing: 
(1) 27 position mark sense reproducer, preferably equipped with 
double punch detection, (2) sorter, (3) IBM 1620 with 40,000 
character storage capacity, and (4) tabulator. Two reproducer 
panels are required, one for mark sense punching each side of the 
answer card. A specially-wired tabulator panel, or an 80/80 panel, 
is required for printing results. 


Procedure 


The key to the entire operation is the proper marking of answer 
cards by the students. Vigorous supervision by the instructor during 
testing thus is of paramount importance. Machine processing begins 
with the mark sense punching of the key and answer cards. It is 
highly desirable that double punched cards, and missed punches, be 
detected at this step. Answer cards are sorted in order that they are 
in consecutive order for each student; student serial order is a con- 


STAFFORD AND BIANCHINI 585 


` venience to the instruetor, but not a requirement of the program. 
The head card, key cards in order, and answer cards in order follow 
the program. Аз each student's answer cards are scored, a summary 

` card is punched out. The set statistics and choice analysis are 
punched out after the last test is scored. The output then is printed 
on à tabulator. 


Program Details 


The test scoring program consists of four major routines: (1) 
input, (2) scoring, (3) choice analysis, and (4) set statisties. One 
operation cannot be executed apart from the others. 

The complete program contains 630 instructions and the com- 
pressed deck consists of 194 cards which can be loaded in about 50 
seconds. The time required to score a set of tests primarily is de- 
pendent upon the number of students and the number of answer 
cards per student. The scoring of a 50-item test (one answer card) 
is accomplished at the rate of about 85 students per minute. The 
scoring of a 200-item test (4 answer cards) is accomplished at about 
16 students per minute. Calculations required for set statistics are 
not demanding, and the punching out of the frequency distribution 
and the choice analysis is done at about 100 cards per minute. 

Right and wrong responses are counted by branching to the 
“rights” counter on an equal condition and to the “wrongs” counter 
on a non-equal condition. The program actually will score any 
character which the computer will read, counting it as correct if it 
matches the key, and wrong if it does not. Since the choice analysis 
area provides only for storing “1” through “5” punches, the maxi- 
mum number of choices per item normally is considered to be five. 


Test Scoring as a Service to Faculty 


The test scoring system at San Jose State College began its opera- 
tional test in the fall semester, 1962. Early experience showed that 
the weak link of the system, as expected, was the mark sense punch- 
ing operation. Students had а very positive reaction to the special 
answer eard, but they were not handling it as carefully as the sys- 
tem demanded. Extraneous marks were picked up to cause double 
punching; about 1 legitimate mark in 1000 was not being picked up. 
When instruetors gave considerable supervision to the marking of 
the answer card, loss was reduced to less than 1 mark in 5000. Per- 


586 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


haps the most important point in favor of computer scoring is that 
scoring errors not only are reduced, but also are much more easily 
discovered than they are in electrographic scoring of the conven- 
tional answer sheet. 

One of the first “extras” uncovered concerns the scoring of two 
or more forms of a test in which identical items have been placed in 
different sequence. Offset reproducing after mark sense punching 
permits all forms to be scored with one key, a feature thus produc- 
ing a single choice analysis. Forms can be differentiated by color of 
answer card, by an odd-even student serial number, or by marking 
"A" or "B" forms as an answer. 


Summary 


A program has been written at San Jose State College which will 
score teacher-made tests on the IBM 1620. Students record answers 
with an electrographic pencil on a special mark sense answer card. 
Fifty-four items may be answered on a single card; up to nine cards 
per student may be used. While only one answer may be marked in 
one column of the answer card, changes in test format can overcome 
the limitation of one correct answer per item. 

The program scores the test and gives raw score, number right, 
number wrong, and sum of right plus wrongs for each student; it 
computes the mean, standard deviation, Kuder-Richardson formula 
21 reliability, and standard error of measurement for the set of 
Scores; it provides a count of the number of tests scored, the sum of 
Scores, and the sum of squares; and it makes a choice analysis and 
provides both frequency and percentages in the summary. 

At this writing the scoring service has just begun to receive its 
operational test. Thus far no fault has been found with the computer 
program; the only difficulties encountered have been in the mark 
Sensing operation. Faculty members who have used this system 
seem quite pleased with the service. This system stands to be 8 


tremendous boon to instruetors who really wish to analyze the per- 
formance of their tests. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor, XXIII, No. 3, 1963 


A COMPARISON OF TWO COMPUTER-BASED 
PROCEDURES OF ORTHOGONAL ANALYTIC 
ROTATION WITH A GRAPHICAL METHOD 

WHEN A GENERAL FACTOR IS PRESENT: 


MARY L. TENOPYR 
North American Aviation, Inc. 
AND 
WILLIAM B. MICHAEL 
University of California, Santa Barbara 


Problem 


Тнв purpose of this investigation was to compare the results of 
two computer-based methods of rotation of factors—the normal 
varimax and the quartimax procedure—not only with each other, 
but also with graphical rotations when the factors have been derived 
from correlation matrices of personnel ratings. Ratings usually pose 
problems in factor analysis because they customarily contain a gen- 
eral factor, the presence of which complicates the rotational process. 

In a recent study by Michael and Tenopyr (1963), varimax and 
graphical rotations of factors that resulted from analyses of two 
correlation matrices of personnel ratings associated with two dif- 
ferent sets of instructions to raters were compared. In neither in- 
stance did the varimax approach yield what could be judged as & 
satisfactory approximation to simple structure. In fact, it appeared 
that the extent of the departure from simple structure might be 
related to the size of the general factor. Moreover, in the process of 


1Deep appreciation is expressed to J. S. Mathews, Director of Personnel, 
Autonetics Division of North American Aviation, Inc., for his supporting the 
study and for his providing the assistance of his staff in data collection, and 
to Miss Suzanne Ritter for her statistical computations and assistance in data 
compilation. 


587 


588 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


maximizing the amount of variance in the columns of the fact 
matrix, the varimax solution appeared to distribute much of the 
variance of the general factor over other factors. It was thought 
that the quartimax procedure, which tends to maximize variance 
over rows instead of columns of the factor matrix, should be in- 
vestigated as an approach to the analysis of correlational 
suspected of containing a substantial general factor. 


Subjects and Procedures 


Using a graphic rating scale consisting of 16 items conce 
performance and personal characteristics (which are enumerated 
abbreviated form in the tables), one group of 47 assistant foremen, а 
second matched group of 60 assistant foremen, and a third group of 
55 fellow employees (peers), respectively, rated one sample of 110 
employees for promotability, a second matched sample of 110 
ployees for counseling-and-guidance purposes, and the first sam 
of 110 employees for promotability. E 

Factors that were obtained from a principal components analysis 
of the correlation matrices of the normalized personnel ratings 
responding to each of the three conditions of evaluation were rota 
by graphic means, by the varimax procedure, and by the qua: 
method. 


Results 


For the supervisors’ promotion-determination ratings the orig 
factor matrix and the graphically-rotated matrix are presented 
Table 1, and the varimax and quartimax rotated solutions are gk 
in Table 2. Similarly in Tables 3 and 4, the corresponding fout 
matrices associated with ratings given for purposes of counseling 
and guidance are furnished. In the instance of the ratings by felloy 
employees, the respective matrices are presented in Tables 5 and 6 

Supervisors! ratings for promotability. When ratings were given b 
supervisors to determine promotability, the original factor matr 
contained one very large factor that accounted for 88.9 per cent 
the common-factor variance, and three extremely small fac 
Upon graphical rotation, the size of the general factor interpreted 
be general bias or halo effect (GE-HE) was reduced in such а 
ner that it accounted for only 67.2 per cent of the common- 
variance. Only two of the other factors (II and III), which 


TENOPYR AND MICHAEL 539 


described to be human relations skills (HRS) and productivity (P), 
appeared to be meaningful. (See Table 1.) 

For the same data, the varimax rotation yielded what were es- 
sentially three general factors. Each factor accounted for approxi- 
mately one-third of the common-factor variance. The fourth factor 
was а residual. For the varimax and the graphical rotations, the in- 
lerpretation of the factors was very similar. The first factor was 
identified as general mental ability or general bias (GMA-GB); 
the second was identified as human relations skills (HRS) ; and the 
third was designated as productivity (P). (See Table 2.) 

The quartimax solution produced quite different factors. In ad- 
dition to a general bias (GB) factor, it yielded two bipolar factors 
and a residual factor. The first bipolar dimension was essentially a 
non-productivity (NP) factor with a substantial positive loading 
on the item concerning emotional control, and with substantial nega- 
tive loadings on productivity and energy shown in work. Reflected, 
it would be a productivity factor. The interpretation of the second 
bipolar factor was questionable. A noteworthy feature about the 
bipolar factors is that they were essentially very small. The general 
factor accounted for apparently as much of the common variance 
after rotation as it did before rotation. (See Table 2). 

Supervisors’ ratings for purpose of counseling and guidance. In the 
original factor matrix the percentage of variance associated with the 
first factor (77.5) was smaller than that corresponding to either one 
of the two other original factor matrices (88.9 and 83.7). 

The graphical rotations yielded four factors which were tentatively 
identified as general bias (GB), human relations skills (HRS), 
productivity (P), and job knowledge (JK). The general factor in 
this case accounted for only 44.4 per cent of the common-factor 
variance. (See Table 3.) 

The varimax rotations also resulted in four factors. These were 
named initiative (In), human relations skills (HRS), productivity 
(P), and job knowledge (JK). These factors generally were fairly 
similar to those for the graphical rotations. Simple structure was 
again not achieved; however, the extent of the departure from sim- 
ple structure was apparently less marked than that for the other 
two sets of ratings. (See Table 4.) 

The quartimax rotation resulted in two relatively firm factors 
and one doublet. These factors were described as general bias (GB), 


590 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


тщ 4 г = р" AEN А ы — 
+зхәз әчз UT рәатлозәр әлә 103957 Yous до чотзйтловәр qeorxeunu sua чавәсәа ssseuquered up злхәззәт Jo SF 
“(гч вәтзттївипшшоо шолу SB ттәл SB) sZurpeo[ лозоеј тте шолу pej4Tuo ere S4urod тТештоэр Sieu40 әча үче әтаеу STUF са, 


т 


әооеүхед хотозд 


ey Фет Lest alg 6'T NE i 6°98 ошоо T01 Jo ә?езләоләд 

+6 бет 9L1 96:1 тг ec 99 00°OT SOUBTIVA 103294 UOMO 
19 €t ог 9c Te +9 So- go- 90 16 Хатуттудезошоіа "9I 
LL Lt 06 т 69 LL +0 6t Ez- z= K81eug ‘ST 
z8 60 gt 9c Tg eg 60- 60- Lo 06 queaipne *«T 
96 go гг 6c 66 96 90- +0 at EL xio^ Jo зуге "CT 
eL сг гг 6z £L eL Lo o~ го +9 Жуүүүтатзпойвәл әлош Supxesg ‘ET 
21), єг Th 9T eL LL 60 or gt- © SupzyuwS10 "TI 
09 eu сг с TS 09 o£ 40 60 TL sweep; ләп Supdo[eAaT "OT 
9), or [74 LT el SL 90- 10 St- <8 SuruueTd ‘6 
19 TT <o gs 16 19 €o- go 1€ EL толушоо [9uoT3oUg '9 
eL 22 то 2o og eL от = ا‎ 9L szepqoid Suparog 'L 
oL 8 66 00 66 oL 90- o£ Ec- iL A4TATa»npodd °9 
9L 0 Lo ez 4g 9L Si e2- yo cg suopgonrgsup Sujdswip *6 
[274 со Sz 16 eu 95 60- eT 62 69 S19X10A—02 qQ31A U0T4919d00) У 
69 oz oc 19 th 69 60 6т <¢ eL хотзтлләйпә Чата uoT4wredoo) * 
as 92 £t от 69 [24 et Жї» = gr 69 әЗрәтлоцпу qor 'г 
9L 00 ge тё 6L el 9т- co TI- 99 әопзүтәл-ә5 'T 

(2) (а) (sum) (a) 

AE. -- XII II X zd AI тїї II I 


gü 
103994 лозов 
XTIEN 104294 рәзвзон ATTeoTUdeID хтлаеу 103994 TeUTSTIO 


вәтазтлэд Wo4I 


ee 
XSONILVH NOIIVNIWSGXLSI-NOLIOWOHd 1SHOSIAHSdnS НОЯ XIHULVA HOLOVZ QXIVIOS XTIVOIBdVHO ANV XIULVA HOIOV4 TVNIOIHO 


T WISYL 


591 


TENOPYR AND MICHAEL 


*зхәз 9u3 UT рәатүлоѕәр әлә 104997 чозә до uoT4dyrosep твотхәшпп 9803 Чазәпәа sossyquerted ur sSio439T JO вЭптизэш 9uL, 
———————————— M —— —— ——— ——— — CÓ —— X — M 


6t эт Ph бөз О LORE 62... о тодо Река 


TZ’ 2s" zs’ oo'oT gr eee i£ г 9ouepreA X04081 чошшод 
40 hoe то 60 16 hl oT Ch 66 0S Хүтттаззошола ‘9T 
AL So zo~ @- z8 9L 9t oL 46 1€ K31eug ‘CT 
28 80- 00 Tt 06 EL тї 2i +6 06 queuSpne HT 
96 10- +T Lo €L ES со се ` nS < хол о Хатте ‘CT 
eL в zo So 19 85 gt Of Sq m KqrlTaTSuodser әлош Эсүҳәәб ‘tt 
AL OT 40-  gt- <8 6L Se 19 6c + Supzius310 "Tr 
09 TE go +0 iL 66 Ch [4 St [74 ѕвәрү nou 9updoreAeq ‘oT 
6  90- $90 а- 9g 99 гт 16 Le St Эчүшизта C6 
19 то- гг Te €L +9 єт 9r oL +С тохуџоо TwuoT4oWg 'g 
eL Lo g- 60 9L oL gt [44 сг zl suspqoxd SupAToS */, 
DLE SOI TO= r TL 89 80 9L ea "e Кутлтзопроха *9 
9L 9T- St гт 19 9L 10- 1€ © <9 впотаоолазит Bupdsery с 
e$ <O ё т 89 T9 zo oc 69 ce SI9X10A-02 UTA UoT4919do0) 3 
69 CT ge ёт eL 99 сг 92 т), 9t UoISTAISÓns Чата uoT491edoo) * 
2s тт gT- zo 69 ES 6T HE gz 4S eSper&oux дор 'z 
gL St- €o- or- 98 TL со 95 Er ttt әопзттәл-ггәв 'T 
(oc nIn SE ШУЫ СБ „лсо ЖАЗ i EU NEAL A Ошка” С 
(2) (2 (a) (во) (a) (a) (енн) (о-у) 
et AI TII) LI I mi ДТ SEIR II I 
6201291 203994 ѕәтазтхед WIJI 
UOT4930H Xuur4renb UOT4930H XEUTIEA 


куе EEE ————ЕЕӘНӘӘЫШӘШӘШӘӘӘӘӘШӘШНӘНЦ 
SONILVH NOLLVNIWSXLSI-NOLIOWOHd ,5ЧО5ТАЧЯ4П5$ НОЯ SHOLOVI JO NOILVIOH XVWIIHVDO AO CNV NOIIVIOH ХҮЙТНҮА dO SITSI 


© WIdVL 


"gum эчу чу peqriseso эге 103083 ҷоеэ уо чоуйрлозәр (woprounu oyy VIN вовоцдчогей uy 019339 Jo э®чүшөзш ONL, 


E s ^ < à - 4 = а 92uwp1wA 1040€ 
2 6 вот CIT COT ve сг ow 69 #6 S'LL пошоюд тозор Jo s3ujueoied 
2 E OT og't o6't MY £2 on" 99 16* €9'L  92»uwpreA 304001 uoumo) 
Бо eo о tt «€ € $9 gt lr өбө gr г Луүттаззошола ‘9T 
6 «t “г 95 zo © 66 9т- 60- 9c so~ 99 Azua ‘ST 
El 9 # то 6 ge L 99 гг ét- то 90- 9L queuSpnp ‘HT 
is 12 N 2 & & i$ zo е то 90 2L xio& jo AZTTENH "CT 
3 1$ o wf co lit gY TS 6- бо te £e 2 Дуүттатзчойвәд exom Bupyesg "zr 
d € w с © 9 я а gt æ е со Op Supzpuv3iQ ‘TT 
ме TO о со оэ ч o T Tr 6 W ѕзәрү Meu Эптйотәләт ‘OT 
19 т (т гү o 9 T9 +0 qt" Te St- € Эптцизта *6 
es lo it wm & es fo le e$ 6t тохјџоо твцотзошя 'g 
oL E € oo ы 19 oL La gt gt- te М вшәтдола Supapog 'L 
© g 2 68-18 3 a 9- T LE zo 9L Жуүлүзопролд '9 
о $ $t я о & т lo- оо SL suopgonrgsup 9updswiD +6 
э «+ я х wo я э от з 90- 6a 09 олоҳхом-оо JPA порузләйоод 1 
Я 19 г < A K 19 $0 ~ £t 4 г, чорзулләйпә чурл чотуеләйоо2 
cs = © то 6 € SS O я lo e p ӘЗрәтлоцу qof 2 
E npn бо oc ot X p Y OTF а gr 90 y әошеттәл-дәв T 
7. 
z (+) (x) (а) (sm) (e) 
з А AI III II I гч А АТ HI. ЖЕ T 
Pu 203994 вәтдетхәд WJI 
xjrywWw 1032-4 pool Arropen xT13wW 03294 TUTTO 
———————————————— 
z OKIDNA ZONWILIO-CNY-OMIIZURnOO «(SUOSTAHSENS BOA XIEIVM BOLOVA GXLVLOH XTIVOTHAVED CNV XINIVA UOLOVA TVNIDIUO 
сане 


i neces eee ee 


~~ ow. Oe бз CE ся y« tir CU Cw UO owe) MÀ os d 

wa ow ж ist wv т Bs CUE yg ээшттлед 1030mg чошюо 

w- co m lib a o о [21 € Катутуаэзотоха ‘9C 
@ z $t- ч ot- 19 о £o т Н bs 9 Miwa ‘GT 
id ч irt o и 15 о я g o зчәш®рпг E 

9? & ш aou E Ж éry x к 9 xon jo kapu * 
i1 хо я æ » vs о z oz a 9 Aarrrarsuodse: әлош Pure ‘zt 

= w- & > w u « Té 9s ee gw SuyzyuvSi0 "Tl 
- Tw we Dos х it gt ог = 06 swep; ләп 3upydo[9AwI ‘OT 
1-9 © it p mw y L x тг Ls $t о Эчүшшта ‘6 

А Oe qu wos wD & 65 ço п Lo „и & толуюо тэчотзошд 3 

aes 9- ع‎ u 65 п `$ х o 65 ewetqord 9upATog * 
а ах © & o 9» a го le 1, г o Aqtatqonpord '9 
e ft a о 9 т9 o 7 x “a TY suojy4oni4suy Futdsvip *@ 
2 u © we ts g гэ т тг y oL ot 919X10^-02 чуул чотуоләйоод 4 

wo o g кх ө 9 «t 60 $ < cz uopopAredno gar uoT4wredoo) ' 
© о E о & y 9$ TO T9 © gt т sSpetwoux qor ‘zZ 
з @ eo p ч 95 Lo ec 1 с & әошоүтәл-дәз ‘T 

b (а) (© (à (әш) (e) а) (ж) (4) (sm) (vr) 
EA ж ZI I I E A AI І I 
eens ogma SOTQUFIVA WIJI 
یات‎ rume чоўузтуон XWETIVA 


ич ZECXYILDO-CXY-ONTIXSEODO .GWOSLANXGOS BDA SOLIVA 40 NOLLVION XVKLDIVOUD 4D ONV NOLIVIOU XVWIUVA 40 GIIDSSM 
* vm 


504 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


human relations skills (HRS), and productivity (P). The general 
bias factor accounted for 76.9 per cent of the common-factor ri 
ance; the rotation by the quartimax method again failed to reduce” 
the amount of variance associated with the general factor. The 
quartimax procedure again yielded a large number of negative load- 
ings. (See Table 4.) 

Fellow employees’ ratings for promotability. The graphical rota- 
tions for the original factors yielded only five interpretable dimen- 
sions. These were named as follows: general bias (GB), human re 
tions skills (HRS), initiative (In), general mental ability (GMA), 
and motivation (Mo). The general factor accounted for 52.3 per 
cent of the common variance. (See Table 5.) 

Rotation by the varimax procedure yielded six factors, one of 
these a doublet. Again simple structure was not achieved. Although 
there was more variance in the size of factors than with sup isors’ | 
promotion-determination ratings, there was no factor so much larger \ 
than the others that it could be considered the only general factor. В 
The factors were named as follows: quality of work (QW), humam 
relations skills (HRS), initiative (In), general mental ability ' 

(GMA), productivity (P), and organizational ability (OrA). (See | 
Table 6.) 

Only two firm factors resulted from the quartimax rotation of fac- 
tors from fellow employees’ ratings. These were a general bias (GB) | 
factor and a human relations skills (HRS) factor. All other factors 
were residuals with only one loading above .30 on each factor. Again, 
no variance was removed from the general factor as a result of the 
rotational process, and the general factor still accounted for 83.7 

per cent of the common factor variance. (See Table 6.) 


Conclusion 


It appears that when applied to data having a large general factor, | 
the varimax method appears to remove too much variance from 
the general factor and to redistribute the variance in such a manner 
as to preclude simple structure. On the other hand, the quartimax 
procedure appears to yield fewer factors and to remove #00 little 
variance from the general factor. Except for some clarification OF 
minor factors, it contributes little to psychological meaningfulness.. 
Although graphical rotations are extremely time-consuming and ex 


TENOPYR AND MICHAEL 


Cu 


"3x93 993 UF рәүловәр ar» 104297 Hona Jo торуйрдовәр тәоүләшли su Wjeoueq sosoqjuored UT еләјә Jo sPugueom our, 
т 


E'S т'9 6:6 б'от g'e ec Шт zz gz 9' S L'E подој тезод 30 әЗвүпәоләд 

19° 99° Lo'tgttoo't $9'6 g + ot 06 — «4$ 90'6 әоиәтлед лозоед uouuop 
© сос & o © ә £L St- $0 от г0- 90 © Хатттавзошола 'от 
М OC LE бг 60 o2 T9 w 90 o2 £o Ste TT TE £3ieug ‘GT 
99 TE O 4 6 SC ly 149 lo rot L т To SL queuSpnp ‘tT 
09 60 го к бо ge 19 09 of gt г Lī- £o æl MTOM go .Каттеоб ‘ET 
OL oo 9 oz 665 02 A O E 12 £o ж £0 а Хатттатечойвәл exou Эчүҳәәб "zt 
*L 92 oo го & LE 9 4h oat lt ge a гт- 08 Зит2тиеЭло *тт 
$6 lt ot 62 LE тг Gt 66 10 GO low Jt Lo т =вәрр ләп Эптїотәләт ‘OT 
a €t OE TO бу fe T9 aw co бг lz го- LE BuruueTa ‘6 
OL €t grt тт et 9S $6 Oh Со 10 oo Lo æ- LL Tox4uoo тепотаошя 'g 
€, l2 00 TH ‘th ог os © ж w^ их ox ст LL swe[qozd 2uTATog 'L 
19 & гс z oo бо 66 OG EIN бор но “te оп ЖутАтзопрола *9 
€9 62 т GOT гт ob £9 lo SO о Tre «г © $uop4onr4suT Butdsery *6 
99 $t 92 бт го 95 th 99 <o ёт тї tt- WE- о SI9XIO0^-OO чүүл чотузләйоод a 
86 so it со TT TS 2 95 20- то So  90- гс- 99 хмотзтлләйпз цатм uoIT4€91edoo) * 
89 0 6t « пт то 69 89 gi- то £0 90- ge © eSperMoux gor ‘g3 
£L Tt «o lt со т æ а ¢t- л- а- gt п gb әошвүтәл-у[ә8 'T 


(2) (ом) (ухо) (чт) (sux) (gD) 
ОСЗ Ay AY II X I gl МА” = A> АТИ ЕТРЫ ИТ 
5203294 203994 вәтаетхед WJI 
жулуш 104994 po3v30H ATTwoTRMEID XPI3UQ 10408A T9UI2T0 
ج‎ PILILLOL;LOAEEALLIUIILLILAaNbs ALKBK?LwLn GLA auiaoeeloCOUDAELAAAXo"o sonI],Aicu.mmq ó—— 
SONIIVH NOLLYNINHZISI-HOLIOWOUd ,SZZAOIJWH МОТІЯЯ НОЯ XDLLVA HOLOVd CSLVIOY XIIVOIHdVHD ANY XINIVA HOLOVI TVNIDINO 


чи < grav, 1: 


*qx04 әчу UT рәатлозәр эхе 1040F yous Jo џоүзйтлоѕәр теотләшоч su? ujyesusq вәзәцұлэлед чү вләззәт JO sSSujuesu әд 
эәочеүхед X04083 


= бг тє geo wE Ly Leg Ttg oo HST SEL U92 TST оошо) egos Jo eBequeozed 
E 
E 16° 98° 16° LE 15 66 өв“ gz 99°t LET 18'2 TET әопердед 104083 чошод 
¢, lo~ gr zo- zo Јо © DW Uude 6 E 9€ Хатттатзошоха ‘9T 
4h ge 60 го- то 20 18 4h et т G (X 6c TZ zou *61 
99 lo~ lo 62 90- 00 © 9 а бач 6604 le 1e 068 4ueuSpne «t 
S 09 90 zt zo $e TO Z бс ct 16 c6 t 66 CU xio& JO ATT "CT 
3 oL 60- 90 40 оң 90 © oh "op. ee ga —99  68- Wt Хатттатечойзәл exou 2upxeeS “ZT 
ó HL ot- &- то TO $0 6), Са ЖЕМ ЛҮ болы CC c cy ed Buyzyuesi0 "TT 
DO. 66 zo- oo St at Lo 1 сс: ст 6 € c 6: gt seopt aou Эптйотәләт ‘OT 
8 zL 60- 1z- lo~- ge so Ш €, of we е 9$ а Эпршзта '6 
o o jm 90 ©о- +0- © 9 СЗ cs eee) E 195 бї тохлупоо тТепотзошя *9 
B cL lo: So 1 г өт L ар. 23 ge. t9 15° =з 06 вшәтдол@ SupATOS “L 
g9 6€ co so~ go~ ot- TL 19 Xt 69 02 6t $2 og Хатлтзопроха '9 
@ сә ет ST ST lo бо- ©), 19 е 2 gt £d TET OR suoJ4onijsur Supdswxp *6 
99 «0 Јо го +0- D oL 99 90 22 тг gt oL гт вдәўңлом-оо TTA UuoT491edoop x 
g 85 Tt 20- Tt TO 89 gs тг Хх at Sz сә тг worspAxedns u4T^ чотузләйоор ' 
> Щз, Go At gor co~- Se 9L 19- 20 $4 (62 9€ . 6t © әЗрәтлопу qor *'Z 
3 <, го- то £z- Te a” 9 С 62 Th ог 02 .fc 16 әәшеүтәх-тәб "1 
5 (2) (G) (2) (2) (em) (85) (уло) (а) (уно) (от) (sum) (1) 
E = 12 А AT TH On a u Kd» AL SEEM eae, o 
- gored X0399X вәтаэтхед Us jT 
B тотуззон xewn тотуздон хешухед 
SONIIVH HOILVHIRISLZI-NOLIONOUd (SEAKOTANG вотіяя ноя SUOLOVA JO NOLIVION XWIN AO ану NOLIVIOH XWEVA JO SUIDSSN 
8 9 FIT 


TENOPYR AND MICHAEL — _ 7 507 
pensive, they appear to yield the most meaningful 
results when correlational data are likely to contain a general factor, 

REFERENCE $ 


Michael, William B. and Тепоруг, Mary L. “Comparability of the 
Factored Dimensions of Personnel Ratings Obtained Two 
Sets of Instruction." Personnel Psychology, in press. 


EAT tai Bur KI ROLL 
"us сен, LT DOR rit 5 

EU rra 
ii itid dot evo 


зед 

голь Jeg! шол! 
1 apa: gas] tà x 
pausa inox m vt 


кеа АШ ШШ пш" 
^ gad итуе) eh! ар Jah 
E rei "^ nha irm 


i меј рти d ef tot pd 


Кани, фойд to меН s lk ФУ ii rd 
"m чув ا2‎ gu Relive U 
ETG ШАД Ar salle ibit 
КШ ART ш; bre balp n 

a 4 A КЕТИ 
iol mv etae: OR S. ioter 


f WS M IV MES 
+. (ч «^ Ü 1 


Sr hon slay aat dw ara yl) М 


д 24, pad tt 4 di 
ji \ юта da E Lh (04 
сая мй ul оо Ма t. ог! 
Г ао о тета ОГО) Ом: Л 
adt bon (O01) оола bas DIE Hanger Jo: олти 
ond eb зидао тууттун BUS АЙ, 095024 cout! ymo Bb 
соат ot ni nadin: lo To E ott ae. кїрє sali 


* 


EDUCATIONAL лмо PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 3, 1963 Ы 


THE WHERRY-DOOLITTLE TEST SELECTION 
METHOD IN FORTRAN 


NATHAN JASPEN 
New York University 


Тнв Wherry-Doolittle Test Selection Method (Stead, Shartle, 
et al., 1940, Appendix 5) is a technique for selecting a battery of 
tests that will give the maximum shrunken multiple correlation with 
& criterion. The shrunken multiple takes into account the chance 
error added by each test. The tests are selected in the order of their 
contributions to the multiple. 

Wherry explains that “the increase in the multiple becomes less 
and less, while at the same time the chance error increases. Finally 
the point is reached where the addition of another test adds more 
chance error than actual validity to the battery. Application of the 
Wherry shrinkage formula after the addition of each test will show 
when this point has been reached and no further test additions are 
feasible” (Stead, Shartle, et al., 1940, p. 245). 

The shrunken multiple, Ё, can be caleulated from the formula 


where R is the multiple, n is the size of the sample, and m is the 
number of variables. 

Unfortunately, there is some ambiguity regarding the meaning of 
m. Wherry (1931, p. 441) defines m as "the number of independent 
variables” and as “the number of tests selected for the battery" 
(Stead, Shartle, et al., 1940, p. 246). These definitions are consistent 
with the worked-out example in Stead and Shartle (1940) and also 
in Garrett (1953, Chapter 18). Ezekiel, however, defines m in the 
shrinkage formula as the “number of constants in the regression 


600 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


equation, including a and the b’s” (Ezekiel, 1930, pp. 174-177). For | V 
a multiple Ё involving & criterion and two tests, m by the former: 
definitions would be 2, and by the latter definition, 3. г 
As а test selection method, the Wherry procedure modifies the _ 
Doolittle technique in two respects: it determines the order in which — 
the tests are selected, and it provides a stopping point. It often - 
happens, however, that the researcher does not wish to throw his — 
entire correlation matrix into the Wherry-Doolittle procedure, but | 
only some portion of the matrix. Furthermore, the researcher may —— 
wish to evaluate a large number of combinations of two, three, four е 
or more tests. Many of these combinations may be nearly equal in id 
effectiveness, and considerations other than the maximum R are _ 
frequently employed in making the ultimate selection. It is also ров ШЕ 
sible that if the tests are selected in some order other than that of a 
maximum contribution at each stage, a higher multiple may ulti- — 1] 
mately be achieved. S 
In order to facilitate the calculation of a large number of shrunken dc 
multiple R's, each based on a submatrix of the total matrix, a com- | 
puter program was prepared. The program was written in Fortran —— 
so that it could be adapted to а wide variety of computers; and in | 
fact two versions, which are slight modifications of each other, now 
exist for the IBM 7090 and the IBM 1620. NC 
The input to this program is in the following order: 


i 1. A lead card, in which is indicated the size (J) of the correla- _ 
tion matrix to be read in. The matrix includes the criterion variable. _ 
2. The mean and standard deviations of the variables included in. л 
the correlation matrix. f 
3. The correlation matrix. ) 
4. Any number of specification cards. Each specification card in- - 
dicates: » 
а. The number of individuals in the sample (N). У 
b. The number of variables in the total matrix, including the 
eriterion (J). E 
c. The number of variables, including the criterion, found in 
the submatrix (K). O 
d. The identification of the K variables, the criterion being 1 
listed last. For example, if J is 12, and if K is 7, the K variables | 
might be: 1, 3, 4, 12, 8, 6, 5. In this example, variable 5 is the 
criterion. P 


ҮЛ 
mte. 
A 


4 


NATHAN JASPEN 601 


The range of K is from 2 to J. Any of the variables may be con- 
sidered to be the criterion. Any of the variables may be used as 
predictors. The variables should be independent. A part score, for 
instance, is not independent of a total score which includes the same 
set of items. The Wherry-Doolittle Test Selection Method is ap- 
plied to the variables listed in the specification card. Application of 
the shrinkage formula may cause the selection process to terminate 
before all the tests specified have been included. Each specification 
card creates a page of output, including the following: 


1. A copy of the specification card. 

2. The means and standard deviations of the variables included 
in the submatrix. 

3. The submatrix. 

4. Two Wherry-Doolittles. In the first of these, m is taken to 
mean the number of predictors. In the second, m is taken to mean 
the total number of variables including the criterion. Each analysis 
includes for each variable its contribution to the multiple, ESSE, 
R?, È, the Beta weight, the b weight, and the b/R weight. The a 
weight is also caleulated. The regression equation in terms of b's and 
a provides “best estimates” of the criterion; the regression equation 
based on b/R weights provides scores that correlate 1.00 with the 
"best estimates" and which have a mean and standard deviation 
identical with the criterion. Except in the rare instance when the 
two analyses select a different number of tests, they are identical 
except in their evaluation of R? and R. 

The last specification card may be followed by a new lead card for 
a different correlation matrix. The number of such inputs is un- 
limited. The number of specification cards for each input is un- 
limited. The size of the matrix that may be read in is a function of 
the computer that is used. 


REFERENCES 


Ezekiel, M. Methods of Correlation Analysis (First Edition). New 
York: John Wiley & Sons, 1930. 2 { 

Garrett, Н. E. Statistics in Psychology and Education (Fourth Edi- 
tion). New York: Longmans, Green, 1953. 4 

Stead, W. H., Shartle, C. L., et al. Occupational Counseling Tech- 
niques. New York: American Book Company, 1940. 
етту, В. J. “A New Formula for Predicting the Shrinkage of the 
Coefficient of Multiple Correlation." Annals of Mathematical 
Statistics, TI (1931), 440-451. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 3, 1963 


BOOK REVIEWS 


Edited by 
WILLIAM B. MICHAEL 


University of California, Santa Barbara 


Collier and Elam’s Research Design and Analysis: Second 
Annual Phi Delta Kappa Symposium on Educational Re- 
search. SAMUEL: T. MAYO .......... eee а... 


Glaser's Training Research and Education. Hans GEORG 
STERN .... соок WELL 


DIETS 

, SiN as vu 
aL 4 Poa. t^ UM 

u WE. 


E d 
broad Н 7 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 3, 1963 


Research Design and Analysis: Second Annual Phi Delta Kappa 
Symposium on Educational Research by Raymond O. Collier, 
Jr. and Stanley M. Elam (Editors). Bloomington, Ind.: Phi 
Delta Kappa, 1961. Pp. viii + 208. $3.50. 

The present volume is the second in a continuing series of reports 
on annual symposia, each emphasizing a distinct aspect of educa- 
tional research. The symposia are in keeping with the first of the 
fraternity’s three goals of research, service, and leadership. 

The seven authors of major papers and nine discussants represent 
such diverse disciplines as education, mathematics, agriculture, so- 
ciology, psychology, biostatistics, and economics. 

Suppose a reader approaches this particular book and asks the 
question, “What can I learn from this book that either I do not 
already know or could not learn from other books on research?” 
Examination of the Table of Contents would reveal several some- 
what specialized topics which one would not readily find elsewhere. 
Among these topics are large scale experimentation (including co- 
operative research), psychological aspects of research design, some 
problems one faces in programs of research, a contrast between 
status studies and controlled experimentation in the sense of manip- 
ulation of variables, the rationale for choosing one research design 
over another, as well as useful information about securing coopera- 
tion from colleagues in an actual experimental program, deciding од‘ 
the size of sample, setting up procedures in educational experimenta- 
tion, and dealing with the problem of values in behavioral research, 

After brief remarks by Professor E. F. Lindquist, Chairman of 
the Symposium, come a pair of related chapters on large-scale co- 
operative experimentation. The first chapter, which is a summary of 
the contents of the partially completed manuseript of the late Pal- 
mer O. Johnson, is authored by Raymond O. Collier, Jr. This neces- 
sarily brief paper expresses first some points of view characteristic 
of the man Johnson and familiar to Collier, the present reviewer, ~ 
and others fortunate enough to have had Dr. Johnson as à mentor. 
Secondly, Johnson's appeal for large-scale cooperative experimenta- 
tion is outlined. Some advantages of, but also some problems in, 
such research are listed. Two actual examples of large-scale research 
are described briefly, one for the Snyder, Texas, school district, and 
the other a statewide comparative study in Minnesota concerned 


600 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


with the improvement of the teaching of mathematics. The second 
chapter, by Paul C. Rosenbloom, describes this statewide study of 
the mathematics curriculum. 

The usual research topics, such as design of sampling surveys and 
controlled experiments, receive distinctive treatment. 

Leslie Kish’s chapter, “The Design of Sample Surveys,” is a 
concise outline of classic sampling survey theory, rationale, and 
procedures. It is the kind of chapter one would expect from a pro- 
fessional mathematical statistician who knows how to communicate. 

The following chapter, "The Sampling Problem in Educational 
Research,” by Francis G. Cornell, makes a strong case for the wide- 
spread need for sampling in educational research, while pointing up 
various barriers to the use of more sampling. Such barriers on the 
part of the public, educational administrators, teachers, and even 
some researchers, are constituted by the misconceptions they hold 
about either the necessity for sampling or the means of sampling. 
A eee of actual sampling situations, both good and poor, are 
ei 

Oscar Kempthorne in his chapter, “The Design and Analysis of 
Experiments with Some Reference to Educational Research," per- 
forms two valuable services. On the one hand, he lucidly describes 
the steps that an ideal educational researcher actually goes through, 
such as sampling from an experimentally accessible population in 
order to make inferences about a target population. On the other 
hand, he points up some of the difficulties one may encounter in 
experimentation. One such difficulty involves confusing the analyti- 
cal survey and the designed experiment. In one example given, it was 
surmised that it might be observed that “Harvard graduates make 
higher incomes ten years after graduation than graduates at the 
University of Minnesota. The immediate reaction of the ordinary 
parent is to make every attempt to get his son into Harvard.” Two 
other difficulties discussed by Kempthorne are the problem of non- 
additivity and the problem of reproducibility of treatments. 

In a chapter entitled “Psychological Aspects of Research Design,” 
S. B. Sells focuses on the cooperation of psychologist and educational 
psychologist in “attacking many of the important educational prob- 
lems of our time.” The three major topics are the influence of value 
systems upon research, the role of situational variables, and the use 
of a system model for incorporating the interaction frame of refer” 
ence in behavioral experimental design. 

In the last chapter, Julian С, Stanley makes a concerted plea for 
more rapprochement between status studiers and variable manipula- 
tors, That rapprochement was needed had been amply demonstrated 
by Lee J. Cronbach in his APA presidential address in 1957. Stanley 
offers further evidence of gaps but goes on to suggest procedures to 
close such gaps. A number of actual studies are cited which seem ab 


BOOK REVIEWS 607 


first glance to have been amenable to a purely psychometric ap- 
proach but which could have yielded much improved conclusions 
had manipulation of variables been introduced. Among such ex- 
amples were the Terman “genius” study and Project Talent. Three 
examples are given of studies designs of which were improved by 
introducing manipulation. In one, an analysis of variance of essay 
test grades was run with reader, form, sequence, and order as main 
effects. In another, order of items in a rating scale and gender con- 
stituted the two bases of classification across individuals, the first 
being manipulated and the second organismic. Finally, there was an 
example of “psychometric experimentation” in the attempt to dif- 
ferentiate male delinquents from male nondelinquents by means of 
a factorially-structured picture projective test of attitudes toward 
authority. 

The present work is not a substitute for a text in research methods, 
but it could well serve as a supplementary text. It presents an in- 
terdisciplinary picture of educational research which is missing in 
the parochial flavor of extant research texts. It should be a must as 
a reference in research courses or in thesis and dissertation work. 
The reviewer has been able to use it and the other volumes in the 
series to good advantage as references in a graduate course in re- 
search methods. Obviously, it is also a must for the professional 
educator who is the least bit serious about the application of sci- 
entific method of educational problems. Such research-minded per- 
sons might well consider employing this second, as well as the first 
and third volumes in the series, as wedges to “educate” recalcitrant 
school administrators, school board members, and politicians about 
the nature of good educational research. 

SAMUEL T. Mayo 
Loyola University (Chicago) 


Theory and Practice of Psychological Testing (Third edition) by 
Frank S. Freeman. New York: Holt, Rinehart and Winston, Inc., 
1962. Pp. xviii -+ 697. $7.95. Д 

This text is short on theory and long on clinical practice. The im- 
balance is more obvious now than when the first edition appeared in 
1950—not because the contents have changed markedly but because 
the psychometric theory developed since this date has not been in- 
corporated into the revised editions. 

The present revision is not a radical one. The text has been ex- 
panded in terms of both the number of pages and the page size. Two 
new chapters have been added at the beginning of the book, the first 
titled “Historical Background” and the second “Elementary Statisti- 
eal Concepts.” The contribution of these additions to the effective- 
ness of the book as a text is not great. Although the historical re- 
view is competent, it is not an effective “grabber” for students 


608 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


beginning а course in measurement. The statistics chapter, which was 
deleted from the second edition of the book, is back in the third. 
However, the chapter is so condensed that its chief value is аза 
review reference for students who sold their statistics text after 
completing the descriptive course. 

The discussion of validity and reliability has been expanded. The 
Technical Recommendations validity categories are treated briefly. 
'The discussion of construct validity, which is extremely weak, will 
mislead the student. Freeman implies that construct validity can be 
established simply by rational analysis of "the degree to which the 
test items individually and collectively sample the range or class of 
activities or traits, as defined by the mental process or the per- 
sonality trait being tested." Since Freeman does not make consistent 
application of the validity terminology in his discussion of specific 
instruments in later chapters, much of the impact of the expanded 
validity treatments is lost. 

More than three-fourths of the book is devoted to a discussion of 
various testing instruments in terms of their content and clinical 
usage. The analysis of each test is descriptive rather than critical. 
Although evaluations are included, these apply to all of the tests in 
a given category such as “Group Scales” or “Aptitude Tests.” This 
commits the discussion to a general level in which the sting that 
might be imposed by the criticism of any single test is lost com- 
pletely. 

Although a large number of tests are discussed, many of the more 
recently constructed instruments are not included. For example, no 
reference is made to the Cattell 16 PF Test, his Analytic-Objective 
Test Battery, Ammons’ Full-Range Picture Vocabulary Test, or 
the Guilford Creativity Test Battery, to name only a few. More- 
over, the failure to cite the publishers of each test is a limitation to 
students desiring to use the book as a reference in later years. 

The book is well edited and better produced than were earlier 
editions. (Few will recognize that the reviewer’s name is misspelled 
in the index.) 

_ As far as text usage is concerned, the book should appeal more to 
instructors looking for an overview of available psychological tests 
a to those seeking a rigorous undergirding in psychometric princi- 
ples. 

RICHARD E. Scuurz 

Department of Educational Psychology 

Arizona State University 


Scientists: Their Psychological World by Bernice Eiduson. New 
York: Basie Books, Inc., 1962. Pp. xiv 4 299. $6.50. 

Bernice Eiduson reports on a significant research project aimed at 

investigating the psychological world of the scientist. The study 


BOOK REVIEWS 609 


provides empirical data about the characteristics, abilities, and 
motivations of scientists in the hope that these data will replace, or 
at least be compared with, the prevalent stereotypes. In order to 
provide the necessary information, forty male Caucasian research 
scientists volunteered to submit to psychological study. The areas 
investigated included their developmental histories, adult person- 
ality structures, intellectual abilities, integrative capabilities, and 
their relationships to the scientific community and to their families, 
The subjects took the TAT and Rorschach tests and participated in 
open-ended depth interviews. The test data were interpreted by 
three experienced clinical psychologists who were under instructions 
to examine the two tests together and to rate the subjects on fifty 
variables dealing with such characteristics as thinking, perceiving, 
personality structure, and motivation. The interviews were all done 
by Dr. Eiduson. 

The results of the analysis provide some interesting insights. One 
finds, for example, that most of these scientists had little contact 
with their fathers, and many of them set out on their own either dur- 
ing adolescence or when they started college. Their personalities 
varied greatly, but all of them had a large investment in intellectual 
pursuits and demonstrated independence in their behavior patterns. 

In the chapter entitled “The Self-Images of Scientists,” the selec- 
tions from the interviews provide the reader with an insight into 
the scientist’s search for a professional identity. The author cites 
evidence to show that “what scientists actually do is a far ery from 
what they think they should be doing (p. 190).” While they may 
feel ambivalent in their role as scientists, they do feel devotion 
toward, and an enjoyment in, their work. They are not bored and 
their leisure time activities are usually active. Play, for the scientist, 
is not an escape from work but a way of using new skills and of re- 
freshing himself for his main role, 

The report is of a clinical study in which the lives of forty re- 
search scientists are examined psychologically in order to determine 
the forces which led them into science. The topic is important, and 
the study was both well conceived and carefully conducted. The re- 
sult is an interesting and informative book. 

HAROLD Вовко 
System Development Corporation 


Management and the Computer of the Future by M. Greenberger 
(Editor). New York: Massachusetts Institute of Technology 
Press and John Wiley & Sons, Inc. Pp. xxvi -+ 340. $6.00. 

In the spring of 1961, the Massachusetts Institute of Technology 
celebrated its centennial year. In honor of this occasion, the School 
of Industrial Management of МІ.Т. sponsored a series of eight 
evening lectures on the theme, “Management and the Computer of 


j 


610 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


the Future." These lectures, together with the prepared remarks of 
the discussants and other comments, constitute the book. 

The lecture series (and therefore the book itself) is a rather re- 
markable achievement when judged by the reputation of the speak- 
ers as well as by the wide range of subject matter. Rarely does one 
find in an edited volume such consistently high quality of material 
or such a distinguished roster of contributors. The eight chapters 
(1.е., lectures) are as follows: (1) Scientists and Decision Making; 
(2) Managerial Decision Making; (3) Simulation of Human Think- 
ing; (4) A Library for 2000 A.D.; (5) The Computer in the Univer- 
sity (6) Time-Sharing Computer Systems; (7) A New Concept in 
Programming; and (8) What Computers Should be Doing. 

Of particular interest to psychologists is the chapter on “Simula- 
tion of Human Thinking” by Herbert A. Simon and Allen Newell. 
Professor Simon, while not a psychologist by formal training, has 
formed a close rapproachment with psychologists, and has been an 
invited speaker at an APA convention. The co-author, Allen Newell, 
who received his doctorate in industrial administration, is associated 
with Dr. Simon at the Carnegie Institute of Technology where they 
are both working on problem-solving programs and on the simula- 
tion of human thought processes. The discussants for this topic were 
Marvin C. Minsky, mathematician and co-director of the МІТ. 
Artificial Intelligence Group, and George A. Miller, a psychologist 
and co-director of Harvard's Center for Cognitive Behavior. 

A most interesting thesis, espoused by Professor Simon, and one 
which provides a great deal of food for thought, is the statement 
that “these [computer] programs can be regarded as theories, in а 
completely literal sense, of the corresponding human processes. 
"These theories are testable in a number of ways; among them, by 
comparing the symbolic behavior of a computer so programmed with 
the symbolie behavior of a human subject when both are performing 
the same problem-solving or thinking tasks" (p. 97). The theory 
deseribed by Professor Simon is the computer program called “The 
General Problem Solver” or GPS. The General Problem Solver is 
described as a “system of methods—believed to be those com- 
monly possessed by intelligent college students—that turn out to be 
helpful in many situations where a person confronts problems for 
which he does not, possess special methods of attack" (p. 98). 

If these claims seem somewhat, exaggerated, it is of interest to 
note that Dr. Minsky, a well-known expert in artificial intelligence, 
is not overly impressed by the attempt to simulate the problem- 
solving methods of intelligent college students, for he feels that 
“within our lifetime machines will surpass us in general intelligence” 
(р. 118). And how does Dr. Miller, a psychologist, react to all this? 
He accepts the concept, finds himself “stuck for something worth 
fighting about,” and concludes with the suggestion that “a psy- 


BOOK REVIEWS 611 


chologist who wants to make his theory both comprehensive and 
explicit is perhaps not driven, but it certainly nudged, in the direction 
of the computer program as a natural way to do it" (p. 121). 
Clearly, computers and computer technology are beginning to 

play an inereasingly important part in our lives—and especially in 
the professional aspects of our lives as psychologists and educators. 
Psychologists must learn more about these machines, their capabili- 
ties, their potentialities, and their role in relationship to them. This 
book contains а great deal of the desired information. Written by 
experts, this carefully edited volume reads well. Greenberger, as 
editor, wisely included biographieal sketches of the partieipants, & 
selected bibliography of readings arranged by topie, and an index 
of the subjects covered in order that the interested reader could locate 
the remarks of different speakers on the same topic. The book is a 
fitting tribute to M.I.T.'s centennial year and a stimulating in- 
troduction to the “Computer of the Future.” 

HAROLD Вовко 

System Development Corporation 


Training Research and Education by Robert Glaser (Editor). Pitts- 
burgh, Pennsylvania: University of Pittsburgh Press, 1962. Pp. 
vii + 596. $11.00. 

Training Research and Education is a welcome addition to a grow- 
ing list of vital books in education, To describe it first in genera] 
terms, the book is a collection of chapters by distinguished authors 
who contribute their latest insights and reports on their own as well 
as the research of others. An effort is made, according to editor 
Glaser, to bridge the gap between the theoretical science of learning 
and education. Too often, complains Glaser in the opening chapter, 
this gap is left unbridged because educational psychologists and 
experimental psychologists grow up in different academic worlds. 
Although the findings presented in the book do not generally repre- 
Sent tremendous new discoveries or advances in the science of learn- 
ing or behavior, it is precisely because the chapters often show how 
theory may be translated to practice that the book will be valuable 
to practitioners in the field of education. 

The book owes its organizational structure to a “systems” ap- 
Proach to education. The five components of Glaser’s “system” are: 
(1) Instructional Goals—the System Objectives; (2) Entering Be- 
havior—the System Input; (3) Instructional Procedures—the Sys- 
tem Operator; (4) Performance Assessment—the Output Monitor; 
and (5) Research and Development Logistics. While it is quite pos- 
sible to quarrel with the “systems” approach in general, and with its 
application to education in particular, yet it must be admitted that, 
as Glaser uses it in this book, it serves to bring order and meaning 
to chapters that cover a wide range of phenomena. 


612 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


The book has few flaws, and these are really minor. Some topics, ` 
such as the one on Individual Differences and the Design of Instruc- f 
tion, might have benefited by a more extensive treatment. More- _ 
over, the writing of the chapters is not entirely uniform. Occasional 
chapters are more difficult to read than they should be because of _ 
the author’s excessively pedestrian style of writing. r 

The group of contributors is a distinguished one indeed. Most of _ 
them, including Glaser, Gagné, and Lumsdaine, are no strangers to 
the student of instructional technology. It is perhaps a significant ` 
fact that much of the research reported in the book was accom- 
plished under government and military grants and contracts. | 

Teachers of research courses on the graduate level in psychology _ 
or education will find this book a valuable text. It is also highly 
recommended as collateral reading in undergraduate or graduate 
courses in education. 

Hans GEoRG STERN 
Los Angeles City Schools 


Approaches to the Study of Administration in Student Personnel 
Work (Minnesota Studies in Student Personnel Work, No. 9) by 
Martin L. Snoke (Editor), Minneapolis: University of Minne- 
sota Press, 1960. Pp. 71. 

This 71-page booklet contains five papers presented at the Center 
for Continuation Study at the University of Minnesota before the 
Institute for Student Personnel Administrators held in August, 1957. 

The major purpose of this institute, reflected by the papers, was 
to combine three basic ideas, through lecture and discussion, which — - 
might contribute to increased effectiveness of student personnel ad- — 
ministration. E. 

First, it seemed that the case study method, as used in seminars at — 
the Harvard Graduate School of Business Administration, would be _ 
an excellent means of teaching personnel workers how to consider 
alternative decisions and their consequences as bases for arriving ab — 
a rationale for administrative action. Robert E. Merry's paper, "The — 
Case Study Method,” exemplified this approach. In it, the objectives _ 
and procedures for use of the case study method are outlined, чо 
gether with а lengthy case illustrating а possibly typical set of is- 
sues reflective of confliets in college personnel work. The author "ew 
points up the distinction between a case and the case method, with E 
the latter а means of developing a skill or way of approach rather — 
than of acquiring precise knowledge. He goes on rather skillfully юе 
indoctrinate the reader into the merits of the case method approach __ 
as an opportunity to learn, to facilitate discussion, and to sharpen ш 
perceptions. NT 

The second major idea incorporated in the institute program, _ 
reflected in three of the five papers, was an attempt to provide OP _ 


BOOK REVIEWS і 613 


portunities for integration of new concepts and research findings in 
human relations, sociology, psychology, and higher education into 
the total picture of administrative decision-making. The first two 
papers were presentations by Donald C. Pelz of the Institute for 
Social Research at the University of Michigan. “Examples of the 
Scientific Study of Organizational Behavior” is a brief review, with 
critical comments, of five different kinds of scientific studies of 
organizational functioning which included the classie study in the 
1930’s made at the Hawthorne Plant of Western Electric and the 
laboratory experiment in democratic, laissez-faire, and autocratic 
leadership with groups of 12-year old boys by Lippitt and Lewin. 
Three more recent studies made by the Institute of Social Research 
give further dimension to the paper. Insights from these studies were 
utilized in the case method discussions by the institute participants. 

In the second paper by Pelz, “Some Important Variables in Or- 
ganizational Functioning,” such factors as participation in decision- 
making, frequency of contact with colleagues, flexibility of organiza- 
tional structure, maintenance of individual motivation, and informal 
group norms are treated with some detail. In both papers, the author 
cautions the reader to exercise judicious restraint in making any 
Sweeping generalizations from the findings of social science research 
with regard to their application in university settings, particularly 
in student personnel work. к 

In support of this second basic idea of integrating research into 
the framework of student personnel administration, a third paper 
presented by Ben Willerman, entitled “Some Results of a Research 
Program in the Social Psychology of Student Life," comes closer to 
the context of considerations by the members of the institute. This 
paper was based upon several years of research at the University of 
Minnesota, designed to provide a more systematic understanding of 
the social behavior of students as affected by their personal charac- 
teristics, their interpersonal relations, their group memberships, and 
their participation in various programs of the university. The most 
pertinent portion of this paper is the last section, in which the author 
cites some implications of research for personnel administration, in- 
dicating some tentative conclusions, but also further needs for in- 
vestigation. 1 i 

A forthright and provocative final paper, supporting the third 
basic idea of the institute, is focused upon the proposition that goals 
and purposes of a student personnel administration must be rea- 
sonable and consistent with those of other parts of the institution. 
This is a paper presented by T. R. McConnell on “The Relation of 
Institutional Goals and Organization to the Administration of Stu- 
dent Personnel Work,” actually appearing as the second paper in 
the booklet. Many stimulating quotes are possible from this paper, 
such as “An institution which takes the position that intellectual 


614 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


education is its sole function, and which relegates moral, spiritual, 
emotional, and social phases of development to an incidental level, 
or even disclaims any responsibility for them whatever, will not give 
any great scope to its student personnel program and organization.” 
A paper such as this could well form the starting point for any local 
student personnel services division which desires to reconstitute or 
revitalize its program. 

The combining of these three ideas, exemplified in the papers, 
actually took place in the group discussions about the cases used in 
the institute. It is regrettable that some of the issues which emerged 
from the process of learning at the institute could not have been 
included in the booklet by the editor. Nevertheless, the booklet has 
considerable value for student personnel administrators who may 
desire to use it, along with other resource materials, as a means for 
self-evaluation and further consideration of the effectiveness of their 
own organization. 

S. Marvin RIFE 
University of Rhode I sland 


Organization Theory in Industrial Practice by Mason Haire (Edi- 
tor). New York: John Wiley & Sons, Inc., 1962. Pp. 173. 

This small book is the second of two books edited by Mason Haire, 
growing out of symposia on theories of organization ( and especially 
modern industrial organization) sponsored by The Foundation for 
Human Behavior. 

The first symposium (which produced the first book—M odern 
Organization Theory: New York, John Wiley & Sons, Inc., 1959) 
had as contributors “a group of social scientists.” Their work was 
discussed by a group of businessmen, At the second meeting, repre- 
sentatives from industry presented papers related to the organiza- 
tion theory-in-action. This book constitutes a collection of papers 
from this meeting. The papers and hence the quality of the book are 
as varied as the authors and their backgrounds. 

The papers range from a simple description of а most complicated 
organization structure to a most urbane and delightful exposition 0 
organizational theory complete with paramathematical models 
(Lloyd’s Origins and Objects of Organizations). Glenn Gilman’s 
thoughtful and scholarly essay on Authority, which is the longest 
presentation, is one of the more provocative papers for the be- 
havioral scientists’ contemplation. 

In these papers one will find an exposition of humanistic philoso- 
phy; “hardnosed” and very down-to-earth descriptions of presen 
structure and functions of ongoing organizations, and quite detailed 
and illuminating descriptions of structure and function in changing 
groups. 

Some provocative (and contradictory) concepts of the place of 


BOOK REVIEWS 615 


man in the organization are specifically presented. Estes and Jones 
see not only man as the essential factor in the structuring of the 
organization but also the needs of the employee and of the executive 
as being crucial to the organization, whereas Scoutten’s exposition 
of the Maytag Company reads like a classical description of “sci- 
entific management” with some consideration for the worker but 
with the needs and goals of the organization being the prime param- 
eter. 

The fact that these men are “businessmen” has not led to mun- 
dane, pedestrian, or narrow papers. Lecompte’s description of be- 
havior in a changing organization in the communications’ industry 
and Kolb’s description of the changing structure of the staff man’s 
role are written with a thoughtfulness, sensitivity, and concern for 
theoretical implications that would look well in any graduate stu- 
dent's thesis. 

If the authors are to be considered as businessmen, they are а 
singularly sophistieated and artieulate group. Perhaps what the 
reviewer is reacting to is Dr. Haire's considerable editorial skill. 

"This book is a must for the industrial consultant or for any serious 
student of business organization. While some of the papers will be 
of more interest to the new student, there is much from which any 
sensitive business executive or social scientist concerned with man's 
interaction with his fellow man in an industrial matrix can profit. 

Hanorp R. MUSIKER 
Rhode Island Hospital 


Stereotypy of Imagery and Belief as an Ego Defense by Rosemary 
Gordon. The British Journal of Psychology, Monograph Sup- 
plements, 1962. Pp. 96. 

This monograph is divided into three major sections: (a) A gen- 
eral introduction to the phenomenon of stereotypy, with particular 
emphasis on its psychological characteristics; (b) A description of 
five somewhat subjective studies intended to test several hypotheses 
based on well-grounded speculation and previous research; and (c) 
Remarks pertaining to the practical, social-psychological signifi- 
cance of stereotypy. The hypotheses, relevant to the spread of stereo- 
typy and to its relationship to personality structure, which are 
couched in an admixture of Freudian and Gestalt theories, were 
tested mainly by means of introspective reports and projective tech- 
niques. The author offered the general conclusion that stereotypy 18 
an analgesic devised by the mind to protect itself against anxiety ; 
that is, pain. 1 

A large part of the relatively long introduction (approximately 
one half of the monograph) is a thorough and rather literary analy- 
sis of the psychology of belief, to which Dr. Gordon ascribed a major 
role in relating cognitive mental process to behavior and action. In 


616 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


places the metaphorical language appeared to detract from the 
scientific intent of the report, e.g., “much like the new civilian re- 
cruits, the recent and novel impressions are forced to divest them- 
selves of their distinct and individual garments, step into the 
uniform provided and march in unison behind the leader—the ‘pre- 
conceived opinion’ or prejudice.” Ensconced in such florid verbiage 
is discussion of some worthy and thought-provoking ideas: (from A. 
Bain, 1872) “Doubt and fear run closely together—doubt in its pain- 
ful and distressing form is a state of fear"; and (from F. C. Bartlett, 
1932) “Uniformity and simplicity of structure of stimuli are no 
guarantee whatever of uniformity and simplicity of structure in 
organic response, particularly at the human level.” 

In sharp contrast to the ample coverage of introductory material 
is the perfunctory treatment of the experimental method. One fears 
that a large part of experimental ingenuity employed to test the 
hypotheses was never clearly revealed because of the paucity of 
detail characterizing this section. More specific criticisms are: 
evidence of questionable methodology (e.g., "The Ss had been given 
the test to complete at home—they had, however, been asked not to 
spend too much time over it.”) ; the most cursory discussion of sta- 
tistical tests used in analyzing the data; a concise but not overly 
convineing defense of the validity of introspective reports. 

Another disturbing aspect of this monograph is the fact that many 
of the conclusions outdistanced the data upon which they were, 
hopefully, based. Far from being definitive, the results, in some 
instances, seemed merely suggestive of the stated relationships. 
Some of the more interesting points were: (a) Imagery, being а 
process probably less subject to control by cognitive and volitional 
forces than are concept and belief formation, is more likely to sue- 
cumb to the influence of an encroaching stereotypy; (b) Emotionally 
toned concepts and beliefs, it might be argued, tend on the whole to 
become rigid and change-resisting, that is stereotyped; (в) Intelli- 
gence may influence the development of stereotyped mental contents 
only when it exists in a subnormal form. " 

Considered as а work of considerable scholarship and heuristic 
value, this monograph is certainly worth reading. 

Patrick J. САРЕЕТТА , 
Miami University (Ohio) 


Sociology of Punishment and Correction by Norman Johnston, Leon" 
ard Savitz and Marvin E. Wolfgang. New York: John Wiley & 
Sons, 1962. 

The questions which must be answered concerning the appearance 
of a collection of readings in a given area are three: (1) Is the col- 
lection an adequate and useful sampling of the field?, (2) Is the 
editorial preparation appropriate?, and (3) Has an attempt 


BOOK REVIEWS 617 


made to integrate the selections so as to provide a unified view of 
the field in question? It is to the credit of the authors of this work 
that they have largely succeeded on all three counts. 

Certainly, one can gain a clear picture of all phases of the correc- 
tion process from (1) a discussion of the operation of the court 
system (Section І), (2) an excellent treatment on the social organiza- 
tion and culture of the prison community (Section II), (3) an ex- 
ploration of techniques of penal treatment (Section III), and finally 
(4) a presentation of a short selection of articles dealing with at- 
tempts at crime prevention (Section V). 

Perhaps the most interesting section of the book to the statisti- 
cally inclined deals with parole prediction (Section IV). Unfortun- 
ately, the statistical structure of such methods is not made entirely 
clear by the Wilkins article, “An Essay in the General Theory of 
Prediction Methods.” One gains the impression that the techniques 
rely heavily on decision theory as well as upon other approaches of a 
More conventional sort. A clearer exposition would have been useful. 

Considerable attention is given in Section IV to the work in de- 
linquency prediction carried out by the Gluecks. Basically, the 
Gluecks chose five factors found to discriminate most highly be- 
tween the delinquent and the nondelinquent portions of a contrived 
sample of matched pairs of delinquents and nondelinquents. Crit- 
icisms of the approach are given by Reiss (“Critique of the Glueck 
Social Prediction Table”). A rebuttal to these criticisms is given in 
turn by Sheldon Glueck. 

One of Reiss’ main contentions is that actuarial predictions derived 
from a contrived population are of doubtful utility in dealing with 
natural populations in which the ratio is much more in favor of the 
nondelinquent youngster. Glueck's viewpoint is that this makes lit- 
tle difference when one is attempting to determine the factors which 
distinguish between the two states irrespective of the population 
from which the two samples are originally drawn. 

Additionally, Glueck presents an impressive array of empirical 
evidence, the “proof of the pudding” that the five factors he has 
identified are important in distinguishing differential proneness to 
delinquent behavior. The entire discussion of the Glueck investiga- 
tions is highly illuminating. It points out the many pitfalls involved 
In predicting complex patterns of human behavior on the basis of 
limited statistical information. 
i The section on prediction wisely includes an article by Hayner, 

arole Boards’ Attitudes towards Predictive Devices,” on the 
administrative acceptance of statistical prediction. The article in- 
dicates that a high proportion of those giving low acceptance to 
parole prediction devices has nothing to do with their technical ef- 
ficiency. Those charged with the responsibility for evaluating parole 


risks are, according to Hayner, (1) reluctant to abide by a purely 


618 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


statistical classification of an individual felon and (2) unwilling to 
ignore possibly significant within-prison experience (a premise of 
many parole prediction devices). Hayner's article, which offers а 
needed corrective to the viewpoint that statistical devices in the cor- 
rectional field are their own best advertisement, indicates that such 
techniques must be “sold” to interested personnel before they will 
be adequately utilized. 

Roworo Torco 

Rip Van Winkle Foundation 

Hudson, New York 


Counseling and Guidance: An Exploration by Leslie E. Moser and 
Ruth Small Moser. Englewood Cliffs, N. J.: Prentice-Hall, Ine., 
1963. Pp. xii + 432. $6.95. 

The title of this book is quite appropriate for its content certainly 
presents to the reader an exploration into all the conceivable aspects 
of guidance and counseling, and also into some of the fringe areas, 
By intent the content is extensive, but not comprehensive; the style 
is elementary, but factual. There are more than 400 citations listed 
in the references. The authors, who document nearly all their factual 
or controversial statements, do not impart to themselves much 
authority. On the one hand, this practice is commendable; on the 
other hand, it does not reflect the authors’ wide experience in the 
field of guidance and counseling. 

Tt is difficult for this reviewer to determine who would be the best 
audience for this book. It surely is a basic book in the field of 
guidance, and the uninitiated reader will receive a thorough in- 
troduction to counseling theories, practices, policies, and philosophies 
—past, present, and future. However, the necessarily superficia 
coverage of the concepts and topics in almost every conceivable facet 
of counseling theory and practice makes it rather difficult for the 
serious student of guidance services to receive the proper foundation 
for the more advanced courses in his program. However, from the 
standpoint of insufficient coverage in certain basic areas for the sake 
of over-all extensive coverage, this shortcoming should not detract 
from the value of a text like this. Personally, this reviewer wou 
welcome this book as an informative supplement to the usual texts 
appropriate for introductory courses in guidance with their usual 
emphasis upon basic principles and practices—primarily in the area 
of school guidance services. It would not be suitable as à substitute 
text in the basic course, as the authors suggest in the preface. Rather, 

‘it could serve as the basis for an additional survey course designed 
to orient the new student to guidance and possibly to stimulate his 
thinking toward the pursuit of counselor training. 

The extensive coverage by this text is evidenced in the table of 
contents. There is a discussion of individual pupil services, group 


BOOK REVIEWS 619 


activities and processes, psychological foundations, student personnel 
work in colleges and universities (including a brief statement on all 
the possible guidance activities in higher education), and counseling 
practices in military and non-military government agencies, in busi- 
ness and industry, in private and community clinics, and in religious 
settings. It is this latter emphasis, covering the last 110 pages of the 
text, which gives the book its unique character. 

The lack of intensity and comprehensiveness is evidenced by the 
fact that in most cases only a paragraph or two is devoted to a dis- 
cussion of rather important concepts, theories, and practices; in 
many instances, only one plan out of many possible ones is presented 
to the reader. For example, in Chapter I, “Foundations of Guidance,” 
in discussing the organization of guidance services, the authors pre- 
sent only the situation in which the guidance staff reports to and/or 
is supervised by the school principal. They omit any reference to 
situations in which the guidance personnel are supervised in their 
professional roles directly by the director of guidance services. 
Notably absent is mention of the research activities and responsi- 
bilities of counselors and other guidance workers as well as tech- 
niques for the evaluation of counseling outcomes. 

In a text which gives superficial coverage to a great variety of 
rather technical topics, and yet is intended for an unsophisticated 
reader, the authors need to be especially cautious not to mislead or 
misinform the naive student. In general, Moser and Moser do an 
admirable job in presenting the reader with a careful and objective 
overview of guidance principles and practices, which neither biases 
the reader’s understanding nor presupposes significant prerequisite 
knowledge. However, in certain instances the omission of pertinent 
facts by the authors on critical or controversial issues might serve 
to leave the reader with impressions which are not only erroneous, 
but possibly dangerous. For example, in discussing personality as- 
sessment techniques, the authors state “. . . the school counselor 
does not, as a rule, utilize personality tests that derive clinical 
categories of personality deviation.” At the end of the chapter in 
which this statement appears, a selected list of tests is presented as 
being applicable and recommended for use by school guidance 
counselors. Among these are the MMPI, TAT, and MAPS, with 
no mention whatsoever of the APA’s A.B.C. levels for test distribu- 
tion and utilization. ( 

Later on in attempting to explain differences between counseling 
and psychotherapy, the authors imply that the school guidance 
counselor may be expected to become involved in psychotherapeutie. 
counseling periodically simply because of the heterogeneous nature 
of the problem cases which are referred to him. There is no ac- 
companying explanation of the vast difference which exists in the 
nature, level, and scope of training between school counselors and 


620 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT. + 


psychotherapists. There are already too many "amateur `рзуоһї- 
atrists" among the ranks of school guidance counselors without 
deluding the uninitiated student in believing that counselor training 
is a pathway toward the somewhat more fascinating and, perhaps, 
mysterious practices of psychotherapy. These are but a few ex- 
amples in the text of how the inevitable failure to elaborate on cer- 
tain important and basie issues might be more harmful in the long 
run than helpful. 
On the positive side, the book is well written and documented, and 
its content is not only extensive, but very much up-to-date. The 
counselor's role is discussed in every conceivable setting; the train- 
ing and experience requirements for each setting are objectively and 
fully presented, and the psychological foundations for guidance 
practices are treated in an extensive and professional manner. In 
summary, this book provides a good means for introducing the stu- 
dent to some of the basic concepts in the guidance field, and to a 
knowledge of the nature, scope, and extent of counseling activities. 
It does not, however, fulfill the task of providing the student with 
much understanding of the myriads of facts and issues to which he 
is exposed. 
PETER F. MERENDA 
University of Rhode Island 


о 


EDUCATIONAL апі 
SYCHOLOGICAL 


MEASUREMENT 


Editor: G. Frederic Kuder, Duke University 
Business Manager: Geraldine R. Thomas 


BOARD OF COOPERATING EDITORS 


Louis D. COHEN 

University of Florida. 
HanoLp A. EDGERTON 

Performance Research, Incorporated 
Max D. ENGELHART 

Chicago City Junior Colleges 
E. B. GREENE 

Chrysler Corporation 
J. P. GUILFORD 

University of Southern California 
JOHN A. HonNADAY* 

Houghton Mifflin Company 
Е. Е. Linnquist 

State University of Iowa 
Freperic M. Lorp 

Educational Testing Service 
AnprE Luin 


Walter Reed Army Institute 
of Research 


SAMUEL Messick 
Educational Testing Service 


WILLIAM B. MICHAEL 
University of California, 
Santa Barbara 


M. W. RICHARDSON 
Richardson, Bellows, Henry and Co. 


JoHN H. ROHRER 
New Mexico 
Highlands University 

P. J. RULON 
Harvard University 


C. L. SHARTLE 
Ohio State University 


THELMA G. THURSTONE 
University of North Carolina 


HERBERT А. TooPs 
Ohio State University 


E. G. WILLIAMSON 
University of Minnesota 


Ben D. Woop 
Columbia University 


DonorHy ADKINS Woop 
University of North Carolina 


VOLUME TWENTY-THREE, NUMBER FOUR, WINTER, 1963 


г Ме, ЖҮЛ, mle wy 
у; та. : 


рну oie СТ 


di Nu 


ЖТ? ‘thes Bah "OLI 


Bed Asti iex 
М. уу 4: д, 


Pi T T vw). . 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 4, 1963 


AN EVALUATION OF A TECHNIQUE FOR SCALING HIGH 
SCHOOL GRADES TO IMPROVE PREDICTION 
OF COLLEGE SUCCESS 


E. F. LINDQUIST 
State University of Iowa 


Тнв purposes of this article are: (1) to present a logical analysis 
of the problem of how to scale or adjust high schdol grades to make 
them better predictors of college grades and more comparable in 
meaning from school to school, (2) to suggest what seems the most 
promising practical method of scaling for centralized use with a 
large number of schools and colleges simultaneously, (8) to report 
the results of an empirical test of this method conducted by the 
American College Testing Program on a large sample of schools and 
colleges, and (4) to discuss some of the implications of these results, 
both in general and with reference to a similar study conducted by 
Benjamin S. Bloom and F, R. Peters (1961). 

To introduce the logical analysis of the problem, we shall first 
describe what is presumably the theoretically best possible “in- 
ternal” method of scaling high school grades to improve the predic- 
tion of college grades. By an “internal” method is meant one that 
utilizes only the information contained in the (high school and col- 
lege) grades themselves, or does not depend on any other data (such 
as test scores) as a basis of adjustment. Let us denote the criterion 
—the college grade point average—by Y, and the predictor—the 
high school grade point average—by X. So far as any individual col- 
lege is concerned, then, the best procedure is to predict college grades 
from high school grades (or Ys from Xs), for each high school in- 
dividually, using in each case the regression equation established for 
the given high school alone. The “sealed” value (X*) of any given 


e EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


high school grade is then nothing more than the estimated or ex- 
pected Y corresponding to that X, as determined from the line of 
regression of Y on X for that school. 

It should be instructive to visualize the total scatterplot between 
the predicted college grades and the actual college grades for stu- 
dents from all high schools represented at a single given college. 
Since the numerical value of each X* is the same as the predicted 
or expected mean of the corresponding Ys, the line of regression of 
Ү on X* for each high school individually will be a 45 degree line 
passing through the origin. This line, whose equation is simply 
Y = Х*, is the common line of regression of Y on X* for all high 
schools. To help the reader visualize this, the scatterplot for a given 
high school both before and after sealing is shown in Figure 1. The 
solid line oval represents the original scatterplot; the dotted line 
oval represents the same scatterplot after the scaling—that is, the 


Sgatterplot of 
X s and Ys for 
given high 

school 


1 х, х1 


Figure 1. Showing effect on scatterplot for a given high school of substitut- 


ing for each X the scaled value (X*) derived from the line of regression of Y 
on X for that school. 


Х = etd Yr = rer 0 Mss, + My. 
ox 


(To simplify the picture, the scatterplot is shown with an abnormally restricted 
variance of Xs and Ys.) 


وو -— 


- ق 


ie F 


E. F. LINDQUIST 6 


scatterplot for predicted vs actual college grades. Before scali , the 
scatterplots for the various individual high schools will occupy 


different overlapping positions on this chart, each with its own line 4 


of regression of Y on X. After scaling, the scatterplots for all high 


schools will all be concentrated along the same 45 degree line, each 4 : 


in the position of “best fit” with reference to that line. The over-all 
correlations will therefore be increased, how much depending upon 
the extent of overlap in the various original scatterplots. 

For each high school, the variance of the Ys about this common 
regression line is the same as that about the original regression line. 
Obviously, then, the variance of errors of estimate for the entire 
(scaled) scatterplot is simply the weighted average of the error vari- 
ances for the individual high schools. We see, then, that the accuracy 
of predietion (as measured by the standard error of estimate) for 
the college as a whole is necessarily wholly and solely determined 
by the predictive accuracies (with reference to that college) for the 
individual high schools. It is this basic consideration which deter- 
mines the limit of predictive accuracy which can be attained in any 
college by any method of scaling. 

This school-by-school method involves only one important as- 
sumption—that all within-school-within-college regressions are 
linear. On this quite acceptable assumption, no other internal method 
of linear scaling can possibly result in a higher correlation between 
scaled high school grades and college grades in any individual col- 
lege, 

It may be of interest that this method works no matter what the 
Predictors in the various high schools actually represent or measure, 
or how they are scaled originally. The grades in one school may be 
expressed in terms of per cents ranging from 70 to 100, in another 
they may be expressed on a 10 point scale ranging from 0 to 10, in 
another on a 5 point letter grade scale, ete. The variance of errors 
of estimate will differ for different schools, but in any case the 
method will maximize the over-all correlation of the predicted and 
actual college grades and will result in comparable “scaled” grades. 

This school-by-school method, then, is theoretically the ne plus 
ultra solution to the internal sealing problem. It is very far, how- 
ever, from being a good practical solution, either so far as any single 
college or any group of colleges is concerned. _ 

What is wanted in the American College Testing Program is a 


b 


; К", ; 
626 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


method that can, with a large scale computer, be applied centrally 
for all colleges. Consider the dimensions of the computational task 
in applying the method just described with all ACT colleges. In the 
American College Testing Program this year there are over 800 
participating institutions, for which a total of over 400,000 high 
school seniors, representing at least 7,000 different high schools, will 
be tested. In the ACT program, predicted grades are reported in 
each of four college curriculum areas separately, as well as the over- 
all grade point average. It is an extremely conservative estimate 
that in this population the graduates of the average high school will 
go to at least five different colleges. Even on the basis of this most 
conservative estimate, to employ this method centrally, it would be 
necessary to compute at least 175,000 different regression equations, 
each with its own slope and Y-intercept. It would thus be necessary 
to compute and store electronically at least 350,000 different con- 
stants, to be called up and used as each student’s high school grades 
are being scaled. Even with the largest and fastest of modern com- 
puters, this would be quite an operation! 

The impracticability of this method becomes even more apparent 
when we consider that most high schools are represented at most 
colleges by only a very small number of students—in many instances 
by only one, two, or three students. Of the several hundred Iowa 
high schools represented in the entering class of over 2000 freshmen 
at the University of Iowa in 1961-1962, over 60 per cent con- 
tributed only one or two students, and over 80 per cent contributed 
less than five each. Note that these figures are for Тоша high schools. 
Had out-of-state high schools been included, these percentages 
would be much higher. Results would have to be accumulated for 
many years, therefore, before stable regression equations could be 
established for most high schools for most colleges. By that time, 
many of the equations would no longer be valid for the next crop 
of high school graduates, The best that could be hoped for, there- 
fore, would be to scale the high school grades for just a few of the 
largest feeder high schools with reference to each college, leaving 
unsealed, and hence lacking in comparability, the high school grades 
for the majority of the freshmen. What is needed, obviously, is a much 
simpler and more economical method, a method that requires the 
computation and application of only a single conversion formuls 
for each high school, a formula that may be reliably established on 


* 


E. F. LINDQUIST ^ 627 


the data for all of its college-bound students, and applied to the 
grades for all such students. 

It would seem obvious that the best single conversion formula to 
use with any high school is that which “best fits" the many different 
regression equations that would have to be used with that high 
school under the optimum method already described. It is quite easy 
to show that this *optimum" regression equation for a single high 
school is the regression equation of Y on X determined for all col- 
leges collectively so far as this high school is concerned. In other 
words, the practical solution is to scale the grades for each high 
school with reference to all colleges collectively, in the same fashion 
аз for a single college individually under the optimum method al- 
ready described. Figure 1 will again help the reader visualize the 
procedure. The solid line oval would now represent the original 
scatterplot for a given high school in relation to all colleges attended 
by its graduates. The dotted line oval would be the scatterplot for 
the same high school after scaling. After scaling, the scatterplots for 
the individual high schools would again all be “moved” into the 
Position of “best fit" to the common regression line. Just as the 
earlier procedure maximized the correlation between high school 
and college grades for a single college, so this procedure would maxi- 
mize the over-all correlation between high school and college grades 
for all colleges considered together. This method of scaling is that 
hereinafter referred to as *Method A." 

Now let us note next that under Method A the over-all correlation 
would be attenuated by the fact that the college grades themselves 
are not comparable in meaning from college to college. An “A” in 
опе college may be much harder to earn than an “A” in another, 
ог the average high school grade of the students earning an “A” in 
опе college may be much higher than that in another. Presumably, 
therefore, the over-all correlation could be improved by first sealing 
the college grades with reference to the high school grades in general, 
and then sealing the high school grades with reference to the scaled 
college grades, 

This, then, is the nature of the “Method B” that was employed in 
the empirical study. Specifically, a single regression equation of 
X on Y was first computed for each college (over all high schools). 
The expected value of X , which we will call Y^, was then substituted 
for each Y in that college. A single regression equation of Y’ on X 


628 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


was then computed for each high school separately. Let us denote as 
X" the estimated Y’x derived from this regression equation. This 
X" is then the “scaled” high school grade. This procedure thus first 
maximizes the overall correlation of the scaled college grades with 
the unscaled high school grades, and then subsequently maximizes 
the overall correlation of the scaled high school grades with the 
scaled college grades. Again it should be instructive to visualize what 
happens to the total scatterplot representing a set of overlapping scat- 
terplots—one for each college, or of another set of scatterplots—one 
for each high school, or of still another much larger set—one for each 
high school in relation to each college. The effect of this double 
sealing technique is first to “move” the scatterplots for individual 
colleges each into the position of “best fit" with reference to a 
common line of regression of X on Y’, and then to move the new 
scatterplots for individual high schools each into the position of 
"best fit" with reference to the common line of regression of Y' 
on X", 

It should be noted here that Method B was not expected to repre- 
sent any improvement over Method A so far as the within college 
correlations are concerned. No linear scaling of the grades in a single 
college can be expected to have any direct effect on the correlation 
of those grades with anything else. Method B may be expected, 
however, to improve the average within school correlation, and it 
was for this reason that it was designed and tried out. 

While we know from logical considerations that this method is 
bound to maximize the overall correlation (for all colleges and 
high schools considered together), this maximized overall correla- 
tion is really of no practical interest. Indeed, in the study herein 
reported there was so little interest in overall correlations for all 
high schools and colleges together that they were not even com- 
puted. These overall correlations were maximized only as a means 
to an end, or because presumably the average correlation between 
predictor and criterion variables within individual colleges or within 
individual high schools would thereby be improved. 

The empirical test of this or any scaling method, then, lies in the 
degree to which the within-school and within-college correlations 
with unscaled college grades are improved. 

Scaling of high school grades would not necessarily be worthwhile, 
however, even though it were to result in substantial increases in 


—— ee ا‎ 


E. F. LINDQUIST 629 


the within-college and within-school correlations with the college 
GPA criterion. Any optimal solution to the prediction problem is 
certain to involve more than the use of high school grades only. 
More specifically, any optimal solution will involve the use of objec- 
tive and comparable measures of student aptitude and achievement, 
and observations of other student characteristics, in addition to high 
school grades. It is quite possible that when grades are optimally 
combined with other such predictive data, the other data—particu- 
larly scores on aptitude and achievement tests—will measure and 
make due allowances for the differences in ability and achievement 
from school to school that would otherwise be allowed for by the 
scaled grades. That is, the test scores will have much the same effect 
as would adjusting the grades for differences in meaning from school 
to school, and will do this more effectively than will the independent 
sealing of grades as such. The real test of a method of scaling, then, 
lies in how much it adds to the predictive value of an optimally 
weighted composite of a number of predictors, of which the sealed 
high school grade point average may be only one among many. This 
means that the practical efficacy of a method of scaling grades de- 
pends upon the characteristics of the total system of which it is a 
part—in one setting or in one predictive complex, scaling of grades 
might be worthwhile, in another it might not, depending upon the 
extent to which the other predictors serve the same purposes as 
could be served by the independent scaling. 

The study herein reported is concerned with only one of many 
possible complexes of this kind—that employed in the American 
College Testing Program—but it should be highly representative of 
many of the specific complexes now in wide scale use. In the ACT 
Program, eight predictor variables are obtained for each student. 
These will hereinafter be denoted as variables numbered 1-8. Four 
of these are scores on the various tests in the ACT battery: English 
(1), mathematics (2), social studies (3), and natural science (4). 
Each of the other four is the last semester grade earned by the 
Student before the end of his junior year in high school in one of the 
Same four curriculum areas: English (5), mathematics (6), social 
studies (7), and natural science (8). The simple average of these 
four semester grades for each student will be referred to as his high 
School grade point mean (GPM) to help distinguish it from the 
usual high school grade point average (GPA) based on the entire 


630 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


high school record. The predictions of college success provided in 
the ACT program are the best weighted linear composites of the 
ACT scores and of the four separate high school grades. Accordingly, 
во far as the ACT program is concerned, the predictive value of any 
method of sealing high school grades depends upon the answer to 
the question: How does the average within-college multiple correla- 
tion with college grades obtained from ACT test scores plus scaled 
high school grades compare with that obtained from ACT test scores 
plus unscaled grades? 

In the ACT program, separate predictive measures and separate 
criterion measures are obtained for each of four curriculum areas: 
English, mathematies, social studies, and natural sciences. For each 
area separately different ACT test scores, high school grades and 
college grade point averages are obtained. This is done with the hope 
that some differentiation in prediction may be achieved among the 
four areas. It is possible, of course, that differences in the meaning 
of high school grades from school to school are not the same for 
each of these four curriculum areas. Between two high schools, for 
example, the grades in English may be more comparable than grades 
in mathematies, or within the same high school grades in the social 
studies may be more comparable to those in the natural sciences 
than to those in mathematics. This suggests that it might be desir- 
able to scale the grades separately in each of the four curriculum 
areas—that is, high school grades in English might be scaled with 
reference to the college English criterion, high school grades in 
mathematics with reference to the college mathematics criterion, 
etc. It is possible also that if this is done the multiple correlation 
with an overall college grade point average of the separately scaled 
grades in the four areas may be higher than that of the unscaled 
grades in the same areas. 

Accordingly, the empirical study herein reported was designed to 
ick possible comparisons among the following correlation coeffici- 
ents. 

1. The typical or median within-college correlation between the 
overall college GPA and the unsealed high school GPM (тух). 

2. The median within-college correlation between the over-all 
college GPA and the high school GPM scaled by Method А (тух). 

3. The median within-college correlation between the over-all 
college GPA and the high school GPM scaled by Method B (тух”). 


E. F. LINDQUIST 631 


4. The median within-college multiple correlation between the 
overall college GPA and the ACT scores plus the unscaled high 
school grades (Ry 12345618) - 

5. The median within-college multiple correlation of the overall 
college GPA and the ACT scores plus the scaled high school GPM, 
scaled by Method B (Ry.1234x"). 

6. The median within-college multiple correlation between the 
overall college GPA and the ACT scores plus the separately scaled 
high school grades (Ry.12345"6"7"8"). 

Comparisons of тух, and тух, with тух will reveal the worthwhile- 
ness of internal scaling of high school grades when grades are the 
sole predictors. Comparisons of Ёуләмх” and REyzus"er"s" With 
Ry.1234507s Will reveal the worthwhileness of scaling in the repre- 
sentative multiple predictor situation. The latter comparison will 
constitute the crucial test. і 

Тһе study is also designed to make possible а comparison of 
(1) the median within-school correlation between the overall college 
GPA and the unscaled high school GPM, and (2) the median 
within-school correlation between the overall college GPA and the 
scaled high school GPM, the scaling being accomplished by Method 
B. 


Procedures Followed in the Empirical Study 


The basic population sampled in this study consisted of 80,572 
students who took the ACT test as high school seniors in 1959-1960 
and who subsequently completed a year of academic work at one of 
the 173 colleges that participated in the 1960-1961 ACT Research 
Service. For each student selected from this population, ACT had 
obtained the eight predictor variables earlier described. Each of the 
173 colleges that participated in the 1961 ACT Research Service 
reported to ACT, for each of its 1960-1961 freshmen who had taken 
the ACT tests, his first year grade point average in each of fonr 
curriculum areas and his overall grade point average. These теп 
grade point averages will hereinafter be referred to as: E = GPA in 
English, M — GPA in mathematics, S = GPA in social studies, 
М = GPA in natural science, and Y = the overall GPA based 
Upon all courses taken by the student in these and other curriculum 
areas, 

The procedures followed in this study are described step-by-step 


632 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


in chronological order in the following paragraphs. To insure satis- 
factory minimum stability in the results, certain restrictions were 
imposed on the size of subsamples used for individual institutions. 
Specific attention is drawn to these restrictions by asterisks preced- 
ing the paragraph numbers. 

*(1) Only those colleges were used for which complete records 
were available for at least 100 students. A record was regarded as 
“complete” if it contained all 13 predictor and criterion variables 
previously described (1-8, E, M, S, N, and Y). This restriction 
eliminated 113 colleges and over 50,000 students from the basie 
population, leaving & sample of 26,659 students representing 60 of 
the 173 colleges. 

(2) For each student in the remaining sample a simple mean (X) 
of his four high school grades was computed. This grade point mean 
(GPM) will be used in this study in lieu of the usual overall grade- 
point-average, which was not available for this population. (Con- 
Siderable evidence is available that predictions of college success 
based upon these four semester grades alone closely approach in 
accuracy the predictions derived from the general GPA based upon 
the complete high school record. For the purposes of evaluating а 
method of sealing high school grades, therefore, this GPM should be 
just as satisfactory as the usual GPA.) 

(3) For each of the 60 colleges separately, the following measures 
were computed. 


Тхү, ox, сү, Mx, My 
Tis, 0, о, M,, М, 
Tw, о, о, M; M, 
э оз, о M; M; 
Ta) 974 о, My, Ms 
(4) For each college, using the regression equation for X on Y 
unique to that college, a “scaled” college GPA (Y^) was computed 
for each student as follows: 


Y’ = estd X, = пу Mu, + My. 
Y 


A sealed college grade (E', М”, S^, or №) was similarly computed 
for each curriculum area separately for each student in each college. 

*(5) All student records, each of which now included Е’, M", 8, 
N’, and Y', were then sorted by high schools. Each high school for 


E. F. LINDQUIST 633. 


which less than ten complete records were available was then elimi- 
nated. The result of this restriction was to reduce the number of 
students in the remaining sample to 16,650, and number of high 
schools from several thousand to 608. (The preliminary scaling of 
college grades (Step 4) was based on all students for whom com- 
plete records were available, regardless of the size of the high school 
from which the student graduated.) 

(6) For each of the 608 high schools separately, the following 
statistics were computed. 


Tr; fxv’, Ox, бү, сү, Mx, My, My 
Тїк, Тїк, ©, бв, ок, Mx, Ms, My 
Tom, Tau’, 9а, Om, Our, Mx, My, My: 
Tas, Tas’, Oa, Os, Os’, Mx, Ms, Ms 
Ta, Там, Ga, On, Oy, Mx, My, My 
(7) For each student in each of the 608 high schools, two “scaled” 
high school GPM's were computed, using the equations unique to 
the individual high schools, as follows. 


Method A: X' = est'd Yr = rar M.o, + М; 
x 


Method B: X” = est'd Y = туу, ХМ... +My 
x 


Scaled grades were likewise computed for each curriculum area 
Separately, but using Method B only. That is, for each student, 
Sealed grades, denoted as 5”, 6", 7", and 8" (corresponding to X"), 
Were computed, 

*(8) All records (now including Y’, X’, X”, E’, M’, S', №, 5", 
6", 7", and 8" for each student) were next sorted by colleges. The 
earlier elimination of students from small high schools had во re- 
duced the number of records available from some colleges that stable 
regression equations could not be computed for these colleges. Accord- 
ingly, colleges were now eliminated if complete records were not 
available for at least 65 students, This third restriction reduced the 
total number of students in the remaining sample to 9,364, and the 
number of colleges to 31. 

(9) For each of the 31 colleges, the following coefficients 
Were computed, 


r 
ҮХ,Түх', Try, Ry. sere, Ry amx, 


Ry лэззат, Ry.» OTs Ry amas 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


IS 19'5; SST GS yer 
T os. 60° ОТО REL Жо GEZ OO, TOUR ш 
Л ^ SE ИГЕ MOE ИЕС ТОБ ON ed 
егт 0с 96€ 960 £9'T а: re S9 SV Ж О BET 0952 т и FI от 
BEI OEE РЕ #0 £61 "p Tt^ 82 94^" 99°S to 6 E 69' TOI 6 
09'; 887 Ze 59'0 $7 98 FE BO CO” OD. BOE 979 dV 9r ат 8 
005 005 688 900— r Twe TS": dO Чо оре 01:2 S8 C — Sh 95 OT £ 
erc ЧӨ. 32's 8£'0 191 962 29° 0! 99 708 CLE TGS 97 19° v 9 
ПОВЕ Г 6 890 ют 69'3 E 307 ЕЕ CEA OF Zr er g 
BE 290. SF 9000— 89I 158 8% 3089 99° o'e Ieg 065 có 16; OI Ӯ 
195 $6072 e 08'0 Org Oore ООШОТ oq оге FE 99€  £* 9r e g 
ISD FLEE 192 LLI- 780 FFE 802 8/'— 0r SI'S REE УЛ 69° FI Z 
363 6с /6°@ 00°% 80% 9% fa OL" OFT 96'S. SIC 908 = 20 £0 £I I 
00 0% oF 0'0 03 07 ko 205х020 AR гї. Жур RS TR ‘ON 
ювудо p s,vdo Toouog 
puyo jo soneg po[vog —— [BUFO jo вәпүвд pereog чан 
Я poe V pou 
с̧ә I9 05 995 08b = ET 
pis 89У Кесе овех 900 =н вәЗәоә 
$8" 95; 9с 8028 OL = а 09 IT? 10} 
EE 3605 7998) 087 80 — Ore: $0: LF -egra OI 
SO ТИСЕ: 9097 ЧОО? 6097 = 09/2. FOE - 26920-29867 6 
OT 36€ £09 OL. t9 crc Г. 079-861 8 
OI "t We OF 29 D scc 98с ОР Sil L 
юс WE 9089 89°. cQ g's cee 298° 09 9 
ТӨТ: 900 —309- BL 9L ZOE 900€ ~ ВРО 081 g 
OUI - 24'S 902. '90- FZ" - 6l. 59; жй See $ 
00g 68%. BE LL 90 Tige JO; BOF 560г g 
т = 60'S: CHES ТЕ 0927 = OBEN 985 ' One. VOS [4 
т 605 907 8 84 LUZ 067 $29 #02 I 
pS LLLUB Ж. досы мб Press ee a че ы ч cu Aa eee 
00 oz gyi Ж» Xo AW хи 4х4 N ‘ON әЗәПоо 
0 S, V dD PUO 
jo A) sanpsA PS 


— M — M — M — ——M———M——M—————— 


suoynggeug үопразриү fo adwng D sof popoajsny] buros fo synsoy 
I ISVIL 


E. F. LINDQUIST 635 


Results of the Empirical Study 


To show the effect of the scaling on the grades in individual insti- 
tutions, Table 1 has been prepared. This shows, for a haphazard 
selection of ten colleges and ten high schools, the correlations, means 
&nd variances of the original grades, and the scaled values of 
original GPA's of A (4.0), C (2.0), and F (0.0). The upper half of the 
table shows the results of the preliminary scaling of college grades. 
Thus, for college #1 the scaled value (Y^) of an original grade point 
average of 4.0 is 4.06, of an original GPA of 2.0 is 2.79, and of an 
original GPA of 0.0 is 1.53. These scaled college grades are regressed 
to the mean high school GPM of the students, which for 45 of the 
60 colleges fell between 2.7 and 3.1. The mean (GPM or X) of the 
four grades of the students is of course less variable than the grades 
in any individual curriculum area, and, as a result of regression 
effects, the scaled college grades are even less variable. Thus the 
scaled college grades have a considerably higher mean and a much 
smaller variability than the original college grades. However, within 
any individual college, the scaled college grades will, of course, show 
& perfect correlation with the unscaled grades. 

The lower half of the table similarly shows the results of the 
scaling of the high schools grades, both by Method А and by 
Method B. The grades sealed by Method A are regressed to the 
mean of the college grades (which is nearly always lower than that 
of the high school grades). The college grades are also more variable 
than the high school grades of the same students. Hence the mean of 
the X’s is lower than that of the X’s for each high school, while ox is 
usually larger than ox. 

Since the grades scaled by Method B were regressed to the mean 
of the scaled college grades, they show a considerably smaller vari- 
ance than the actual high school grades. It will be noted that the 
scale value (X") of a high school grade of 0.0 is usually about 2.0, 
while that of an original grade of 4.0 is usually below 4.0. This is of 
по consequence, since no interpretation is placed upon the absolute 
value of the scaled high school grade. (By means of a simple linear 
transformation, the scaled grades could be given any mean and any 
Variance desired.) Within any given high school, grades scaled by 
any linear conversion will of course show a perfect correlation with 
the original high school grades. 


E 
z 
2 
2 
a 
4 
Š 
3 
9 
B 
E 
: 
© 
: 
=) 
a 
8 


TLL" FPL 17 191° 069: ^ GIL TOL, FLO — 28° Ол 590 а ое Я 
899° 0F9" 989° 119° 089: 990 B'S 4/3: + OB 99. 0076 SEE ешш 
899' 69° 269° S19" ozo". 197 OBO EO" — OL го OE ПЕЛ ees 
104" LOL" FIL" £99: 300  '. 0897 - 3190: G90 . 39.  30- OEE 90 OG 
169° ©89` 089° 989: 000" pgo. <el Je90' .' I8 c ON IRE UC ED 
919° £19 FE9: 919° 970. 909: O18: ELIE 0h — 20° Лаб. 8SLZ. 3970 07 
F69: F69: 169° Sz" 829° 8677 009 £8 08 cL 225 S92 69 oF 
£69: 012° 169° 1c8* COT - O19: eles 10092 — 92) ВО 008€ = CREE SER gr 
802" #19” SIZ" 099° II? o 9o mel'— "ign Qi € Bete” DEDE Seo Les. "ИНИН БУ 
602" £69" 269° 999* 9090: «#9.» 890 089 L| OL 665 Lpvz 0065. Or 
818° S08" 908° 982" OLS =. OMNEM 00) — Sok: — ML-. 89. — SUEDE c 990 CEL ШШ 
609° £29" 119° £19" 289 089 0S9 S99: т ә 655 7952 eel GE 
£69: 029° 289° 699° 200: 979° 850 о OL дә 607 E 68 TS 
119: 6S" 829° Yee" FES #99 698 QU 072 #0, ОИД ТО ЧЕ 
РР: 969" 263" GPG: OLE Od -00P — SEPE 19 > 99^ IEE 1 OBE Т6 6c 
292: oez’ SLL £04 £007 GOL" SSL" — 3199: 04 997. ОТЕ OPE. EOE cee 
299° 099" F99 629: 009° SOROS ^ £19: OL LEAL EE U Ер 
6£9' 619° 799: 109° 079" —:9:0! 16805: AIO" 4 90— . OP 06:55 ^99 5 
TEL. SSL" $92" FEL: PEL QAL ORL OBL Ane Ма 7 DEE SEE 208 сс 
802: 189* £0L' 919 020:77 £99: 6997 +097 92 9 SI бс 08 TG 
©99` 799: €99* 129° 070: — 809: LEIS 8090: - 08° = 00 EC 9A € Исса AT 
168` £29" ceg” 829° FO TOG a 200". 300; = GÀ | eo —I9 ce 90-89 FL. 9t 
0cL' 069° $804 F99: 109" 999: 999: 609 FL т с 885 VI 
£L SL: 69" 904 £09 - л" SILT 19 Py GO О: чит д SE IL 
02g" PG: Ere” 99r 0009 _. 0097 0a" .— 489^ SL — 39. "zkt . 080. 698 — OL 
geo" gzs Shs" ТӨР" Ich с €. отр ер 2 214^ $5995 0082 0 00 C EQUO T 
OIL” 189° ern 069° со" 269. 68907. 76907 6L Á cL $06 00. 009 9 
GFL: 00 ZSL 912° 0997 - GZL LOL 099: 607 EOL E CEE СИ ee oe 
GEL: SIZ" 684" £99 900: #897 $99" 06€ S95 S9 GES 165 26 g 
G19 ©09` 696` LFS: IS9: 69 99 699 697 S9 EER #5 96 © 
Є18` 984 662" 692° QOL" 92°  S8L — 089 BE. 8L. atte. = CLE 09900 t 
mI he Be RIA 4 1-29 хиту (8 eau SA 199-237 XA Xj XA, Ap Xo AW хи М "ON 
(£D (z1) (11) (от) (6) (8) (2) (9 (9 (ғ) (е) (2) 2907 


(D 


sabayog qonpsaspug 4of synsay fo Ranuung 


E. F. LINDQUIST 637 


It is evident that the scaling resulted in quite substantial adjust- 
ments in the numerical values of the grades for different high 
schools. It is surprising that these adjustments had so little effect 
upon the correlations of these scaled high school grades with the 
college GPA criterion. 

The percentiles reported at the foot of each column in the upper 
and lower halves of Table 1 are computed, not only for the hap- 
hazard sample of ten institutions, but for all (60) colleges involved 
in the preliminary scaling of college grades, and for all (608) high 
schools involved in the scaling of high school grades. 

The preliminary scaling of college grades by Method B was not 
expected to improve the correlation of scaled high school grades with 
the original college grades. But it was expected that the scaled 
college grades would show a higher correlation with the original high 
school grades than did the unscaled college grades. This proved 
barely to be true. It will be noted that the median correlations of 


“original high school grades with scaled and unscaled college grades 


were rxy = .621 and rxy- = .629. Thus, the improvement in the 
medians was practically negligible—only .008. The 90th and 10th 
Percentiles in the distribution of these correlation coefficients were 
almost identical, 

The mass of “output” data obtained from the electronic computer 
Used in this study was so voluminous that it is impracticable to 
report heré more than the most important of the summary results 
obtained. The results of most of the intermediate comparisons were 
never even printed out, and the record of them exists on magnetic 
tape only (but is available in this form to interested research 
Workers). The critical results for the 31 colleges are presented in 
Table 2. 

One of the comparisons of major interest from Table 2 is that 
between туу, and ryx—the within-college correlations of scaled and 
unsealed high school grade point means with the overall college 
grade point average—the sealing being done by Method A. For this 
Sample of 31 colleges, the median within-college тух, was .638 (col- 
umn 8) and the median тух was .604 (column 7), with a difference 
ч 034. The median difference* between ryx and ryx is 047. Assum- 
Sea the standard error of this даа үөү > 

: its 95 i is rou 020—094. 
E per cent confidence interval is roughly 


that the difference in the medians for all colleges is not the same as 
* median of the differences for individual colleges. 


638 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


There is probably a real improvement due to the scaling by Method 
А, but one much smaller than was expected. 

No appreciable difference in the median within-college correla- 
tions (тух, and тух.) for Methods А and B was expected, and none 
was found. The median coefficient for Method B was тух” = .645 
(column 9), which exceeded that from Method A by just .007. The 
median difference between ryx and тух, is —.001. The standard 
error of this difference is approximately .006. This difference, even 
though real, is obviously too small to be of any practical conse- 
quence. 

As explained earlier, the real test of the methods of scaling used 
in this study lies in the increments they add to the correlations be- 
tween the college GPA and the best weighted composite of grades 
and ACT test scores. For this sample, the median within-college 
multiple R between the college grade point average and ACT test 
scores plus unscaled high school grades was Ry.2345e:s = .681 (col- 
umn 13). By combining ACT test scores with scaled (Method B) 
high school grade point means, a median multiple Ry.1224x = -691 
(column 12) was obtained. By combining ACT scores with grades 
separately scaled in the four curriculum areas, a median coefficient 
of Ry.12345”6”7”s” = .693 (column 14) was obtained. The larger of 
the increments is .012 (.693 — .681 — .012) and the median differ- 
ence between these two correlations is .013 (confidence interval 
roughly .003 — .022). Independent scaling of high school grades 
does appear, then, to result in а real improvement in the multiple 
correlation of scores and high school grades with the overall college 
grade point average for this population, but an improvement that 
is extremely small in consideration of the tremendous amount of 
work required to obtain it. 


Interpretation of Results 


Needless to say, these results were most disappointing, particu- 
larly in consideration of the very considerable effort and expense 
that the study involved, as well as in view of the expectations 
raised by the results of previous studies, particularly that by 
Bloom and Peters. In general, the results of this study, together 
with independent logical considerations, seem to lead inescapably to 
the conclusion that internal scaling of high school grades is not & 
promising way of improving the prediction of college grades. This 
seems most evident so far as the ACT population of high schools and 


E. F. LINDQUIST 639 


colleges is concerned, but it is probably true of other large popula- 
tions of schools and colleges as well. 

Ап effort is made below to suggest an explanation of the unex- 
pectedly small improvement in the predictive value of scaled high 
school grades resulting from internal scaling. 

1. As already noted, the upper limit of the predictive accuracy 
of internally scaled college grades (as measured by the standard 
error of estimate) is wholly and solely determined by the weighted 
average of the standard errors of estimate for students attending а 
single college and a single high school. This is equivalent to saying 
that the upper limit of the predictive validity (as measured by cor- 
relation coefficients) that can be obtained through scaling is deter- 
mined in part by the average of the correlations between high 
School and college grades for students attending a single college 
and a single high school, and in part by the differences among high 
Schools in the grading standards employed. It should be noted that 
no method of sealing either of college or high school grades can have 
any effect upon these within-school-within-college correlations. Un- 
fortunately, we do not have any very good information about the 
distribution or level of these correlations for any large or representa- 
live sample of pairs of high schools and colleges, but it is very much 
to be doubted that the average value of these correlations would 
much exceed .60 for any large population. If this is true, unless 
quite large differences in ability level prevail among the high 
schools, the average within-college correlation of internally scaled 
School grades with college grades can probably never much ex- 
ceed .70. 

2. The differences in the meaning of grades from high school to 
high school, so far as college prediction is concerned, may be smaller 
than seems to have been generally believed. Most public high schools 
аге community high schools, and are essentially non-selective in 
character. The level and range of intelligence or aptitude are proba- 
bly much the same from community to community and hence from 
high school to high school. It is, of course, true that many suburban 
high schools, or high schools in restricted and highly homogeneous 
Privileged or underprivileged areas in large school systems, do show 
large deviations from the general norm in level of ability, and even 
larger deviations in levels of achievement. Surely, also, there are 
many highly selective private independent secondary schools. But, 
for the great mass of American high schools, the school-to-school 


640 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


differences in ability to earn college grades may be fairly small. 
(This may be especially true of the high schools in the states most 
heavily represented in the ACT study.) Accordingly, for most high 
schools, (and particularly for most high schools in the ACT popula- 
tion) internal scaling methods can do little to make the grades more 
comparable for the purposes of predicting college success. 

What has just been said about the mass of American publie com- 
prehensive high schools applies almost equally to the great mass of 
American colleges and universities. By far the great majority of 
institutions—the great state universities, the municipal universities 
and junior colleges (which are rapidly constituting a larger and 
larger proportion of the total number), the overwhelming majority 
of private and denominational colleges—are definitely non-selective 
in character, and are not characterized by large inter-institutional 
differences in level of student ability. It should not be too surprising, 
therefore, that the attempt at preliminary scaling of college grades 
had no overall effect on the within-school correlations for the ACT 
population. 

3. Any internal method of scaling high school grades renders 
strictly comparable grades from different high schools only if the 
groups of colleges associated with the various individual high schools 
are on the average identical in the meaning and distribution of their 
grades. Otherwise the grades of the various individual high schools 
will be sealed with reference to different standards. Unfortunately 
for the internal method, most students in a specific high school 
typically attend only a small number of colleges, which are usually 
determined largely by factors of proximity. Most colleges serve pri- 
marily a rather highly localized set of high schools. This, of course, 
is particularly true of municipal universities and junior colleges, but 
it is true also on a larger seale of most state universities and state 
colleges, as well as of the great majority of private and denomina- 
tional colleges. In general, it is only for a relatively small number 
or proportion of “name” colleges that the majority of students are 
drawn on a selective basis from high schools and preparatory schools 
scattered over the entire country, or over a large geographical 
region, 

Suppose that two widely separated high schools are actually alike 
in level of student aptitude and achievement and in grading stand- 
ards, but that a large proportion of the students in one high school 


| 


E.F.LINDQUIST 4. MESE. 
attend a college in which high grades are easy to attain, while most. 
of those in the other high school attend a college in which high 

grades are difficult to earn. Obviously, the internal method of scaling 
may render grades less comparable rather than more comparable in 
such instances. An effort was made to overcome this difficulty in this 

. study, as in the Bloom and Peters study, by a preliminary scaling of 

the college grades, followed by a scaling of the high school grades on 

the basis of the scaled college grades. This, however, is in a large 

part “a bootstrap” type of operation. For example, if Mississippi 

colleges are preliminarily scaled with reference to Mississippi high 

schools, and if Mississippi high schools are scaled with reference to 

the scaled grades for Mississippi colleges, and if the same is true, 

for example, in Minnesota, the internal method would hardly render 

grades comparable from state to state, either for high schools or 

colleges. It is probable that it is this “localization” factor which most 

accounts for the failure of the internal scaling technique to improve 

within-school and within-college correlations in this study. 

4. It is quite likely that the differences in grading standards that 
Most attenuate the correlations between school and college grades 
are those which exist within the individual schools and colleges, 
тайег than among them. The differences in these standards that 
exist among different instructors in the same department, or from 
department to department in the same high school, or particularly 
from “track” to “track” in the same high school, are perhaps just as 
large as those from one high school to another. To identify and 
allow. for these differences is a still more difficult problem, and one 
that is much farther from being solved. Similar observations, of 
course, may be made in differences in grading standards in the same 
E or university, An effort, was made to do something about this 
Eo аря study by scaling college and high school grades вера- 

Y lor different, curriculum areas, but again with disappointing 
Tesults, 
on may help explain why greater improvement in the 

E of college grades was not attained in the present study. 
Е ето though considerably larger improvement had been 
f hos * one might still contend that the internal method of scaling 

eat way of improving the predietion of college grades. 

1 iig 18 conclusion, two factors are of special importance; 

апу system for predicting college grades, it is obviously 


642 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


desirable that whatever predictions are employed should be avail- 
able and comparable for all or nearly all students at each institu- 
tion. To provide predictive indices for a part of the student body 
only, or to provide one index for one part of the student body and 
other (non-comparable) indices for other portions of the student 
body, would surely result in inconvenience and confusion. The fact 
that in this study the scaled grades (or predictive indices) could be 
provided for only 9,364 of the original sample of over 80,000 students 
is in itself highly significant. These attrition effects have already ^ 
been diseussed above at some length and need not be reconsidered 
here. For most populations of colleges, the attrition resulting from 
the restrictions upon sample size (for individual institutions) nec- 
essary to secure reasonable stability in the results obtained, is alone 
sufficient to make internal scaling methods impracticable. 

2. Whatever the differences that do exist in ability and aptitude 
level among a relatively small proportion of secondary schools, there 
are other and perhaps better ways of revealing these differences and 
of taking them into consideration in prediction than internal scaling 
of high school grades. Specifically, these differences may be measured 
directly with objective and reliable achievement and aptitude tests, 
or through other observations of student and school characteristics 
related to college achievement. One possibility is to scale the high 
school grades for individual high schools on the basis of school and 
college differences revealed by the tests, then to use the resulting 
sealed high school grades as the sole predictors of college success. 
Any such “external” method, however, will probably suffer just as 
severely from attrition effects (resulting from restrictions on sample 
size needed to insure stability) as the internal method. It would 
seem, therefore, that the best way of using the test scores (and other 
observations) is simply to employ them along with the high school 
grades as independent predictors in multiple regression equations 
established for all high schools considered collectively, regardless 
of the number of students coming from each high school. Each stu- 
dent's test score (or other measure) might be regarded as consisting 
of a constant “school effect? plus a variable “individual” effect. The 
multiple regression equation will perhaps give close to optimum 
weights in prediction to the constant school factors as well as to the 
variable individual factors, without the need for any isolation of 
identification of the school factors as such. It is perhaps for this 


E. F. LINDQUIST 643 


reason that the scaling of high school grades added so little to the 
multiple correlations with college grades in this study. 

Attention has already been drawn to the fact that the improve- 
ments in predietion obtained through sealing in this study were 
disappointingly small in comparison with those reported by Bloom 
and Peters. For а sample of 23 secondary schools, Bloom and 
Peters reported a median within-school correlation for unscaled 
school and college grades of .54, and for scaled school and college 
grades of .77—an improvement of .23. In the present study, the 
improvement in medians for a sample of 608 schools was only .008. 
Bloom and Peters reported also а median within-college correlation 
for unscaled school and college grades of .57, and for scaled school 
and college grades of .68—an improvement of .11. This was for a 
sample of 13 colleges, with an average of 96 students per college. 
In the present study the median improvement was only .041 for 31 
colleges with an average of over 300 students per college. It is ap- 
parent from the preceding figures that the results reported by Bloom 
and Peters are considerably less stable than those reported in the 
present study, and it is possible that some of the differences between 
the two studies may be accounted for by sampling fluctuations. How- 
ever, the differences are probably best accounted for by the charac- 
teristics of the schools and colleges employed in the respective 
samples. 

It has already been noted that the upper limit of predictive 
validity that may be obtained through internal scaling in any given 
college is set by the average within-school-within-college correlation 
of school and college grades. Only if this average correlation is high, 
and if large differences in grading standards exist from school to 
school, can any marked improvement possibly result from the use 
of any internal scaling method. Now it is very significant that Bloom 
and Peters drew their data from the records of the National Regis- 
tration Office for Secondary Schools, whose services are employed 
almost exclusively by private independent secondary schools, most 
of whose graduates go on to a Liberal Arts program in one of 1ч 
relatively small number of rather highly selective colleges or uni- 
Versities. Most of these secondary schools are quite narrowly col- 
lege preparatory in character, with heavy emphasis upon the same 
subjects—mathematies, science, foreign language, ete.—that their 
Students later pursue in college. The proportion of graduates going 


64 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


on to college from the private secondary schools is very much higher 
than from the publie schools, resulting in а much wider spread of 
school grades among their college-bound students. (In most of the 
secondary schools employed in the Bloom and Peters study, practi- 
cally all of the graduates enter some college or university. In the typi- 
eal American secondary school, however, not very much more than 
half of the graduates go on to college.) It is only to be expected, there- 
fore, that, within-school-within-college correlations for the Bloom 
and Peters sample would be relatively high. It is quite plausible, for 
example, that the correlation between grades in Phillips Academy 
and grades at Harvard or Dartmouth is considerably higher than 
between grades in a Davenport, Iowa, high school and grades at the - 
State University of Iowa, considering particularly the greater di- 
versity in the college curriculae pursued by the publie school gradu- 
ates. 

As to the differences in grading standards and levels of ability 
from school to school and from college to college, these must have 
been considerably larger for the Bloom and Peters sample than for 
the sample employed in the present study. It is significant that 
Bloom and Peters were able to classify their colleges into three 
groups, in one of which the high school grades earned by the student 
were considerably higher than those they later earned in college, in 
one of which the high school grades were considerably lower than 
those later earned in college, with an intermediate group in which 
the school and college grades were at about the same level. In the 
present study, in only 10 out of 608 high schools was the average of 
the high school grades lower than the average of the later college 
grades, and then only by a completely negligible amount. The 
specific institutions of higher education involved in the Bloom and 
Peters study is not reported by them, but considering the source of 
their data, it seems likely that their sample must have included the 
country’s most highly selective institutions, as well as some that are 
very much less selective. In other words, the variance in level of 
student ability among the colleges used in their sample was probably 
considerably larger than those among colleges in general, or than 
among the colleges used in the present study. 

It seems likely, therefore, that the sample available to Bloom and 
Peters happened to be one which was particularly “favorable” to the 
use of internal scaling methods. It is possible that if they had used 


E. F. LINDQUIST ` 645 


a much larger, more stable and more representative sample of 
schools and colleges the improvement due to scaling would have 
been markedly reduced. No specific data on attrition effects are 
reported in the Bloom and Peters study, but the size of samples 
from which final results were reported suggest that these effects 
were severe. It is quite likely, therefore, that much of the improve- 
ment that was obtained could have been more economically and 
more efficiently obtained, and for a much larger proportion of stu- 
dents, by combining test scores and other data with unscaled grades 
in multiple regression equations. 

While the ACT sample may be more representative of the entire 
population of American schools and colleges than the Bloom and 
Peters sample, neither is free from bias. The ACT sample includes 
very few of the nation's most highly selective colleges—particularly 
of those on the Eastern seaboard—and the Bloom and Peters sample 
contains an unduly high proportion of them. For a sample truly 
representative of all American schools and colleges, the improvement 
in prediction resulting from internal sealing would very probably lie 
somewhere between the results reported in these two studies. How- 
ever that may be, two other considerations are of overriding im- 
portance: (1) the high rate of attrition characterizing the use of any 
school-by-school scaling method, particularly with reference to in- 
dividual colleges and (2) the possibility of achieving equally good 
results more economically and without attrition through the use of 
multiple regression techniques in which test scores and other meas- 
ures are combined with unsealed grades. Everything considered, 
then, so far as practicability is concerned, the case for scaling high 
school grades is still very far from having been established. 


Author's Note 


This study illustrates a type of research which is today made 
Possible by the large-scale electronic computer, but which, because 
of the magnitude of the computational and data-processing problems 
involved, would have been regarded as utterly impracticable just a 
few years ago. The study was conducted for the American College 
Testing Program at Measurement Research Center, Inc., Iowa City, 
Iowa. The essential components in the equipment used consisted of 
a tape-oriented IBM 7070-1401 computer complex, an MRC elec- 
tronie document scanner, and several MRC electronic test scoring 


646 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


machines. The input data were recorded and coded in pencil by 
students and college personnel on answer sheets and report forms 
from which the data were automatically transcribed at high speed 
to punched cards and magnetic tape, without the intervention of 
manual key punching. (MRC’s equipment is now all of the docu- 
ment-to-tape type, with punched cards used only for correction 
routines and for program control.) Equally essential to such research 
is a wide scale cooperative organization for data collection—in this 
case, the American College Testing Program—through which data 
may be collected on a strictly comparable basis from a large popula- 
tion of schools and colleges, Such facilities should result in the 
future in many significant studies of institutional characteristics 
and their inter-relationships, and other studies in which the institu- 


tion, rather than the individual, is the basic unit in large seale 
sampling. 


REFERENCE 


Bloom, В. S. and Peters, F. R. The Use of Academic Prediction 
Scales for Counseling and Selecting College Entrants. New York: 
Free Press of Glencoe, 1961. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 4, 1963 


FACTOR ANALYSES OF PHYSICAL FITNESS TESTS ! 


EDWIN A. FLEISHMAN ? 
American Institute for Research 


In a previous article in this journal (Nicks & Fleishman, 1962) 
we reviewed earlier factor analysis work in the area of physical 
fitness measurement. The review described 14 physical proficiency 
factors previously identified, discussed other possible factors which 
might be discovered, and raised a number of questions regarding 
the structure of skill in this area. It was possible to integrate the 
factors into a meaningful scheme and to compile a comprehensive 
catalogue of tests according to the factors they seemed to measure. 
A summary conclusion was that commonly used fitness test bat- 
teries do not cover the range of possible fitness factors and many 
of the tests which are used overlap with one another in the factors 
measured. We also suggested several large scale follow-up studies 
needed to clarify factor definitions and to answer questions raised 
by the review. The present article summarizes two of the large 
scale follow-up studies, 


Approach 


The literature review served as a basis for test development and 
selection for the factor analysis studies. The plan was to divide the 


— 

The research reported here was supported under Office of Naval Research 
Contract No. 609 (32) while the author was at Yale University. The article is 
a highly condensed summary of two technical reports (Fleishman, Kremer, & 
Shoup, 1961; and Fleishman, Thomas, & Munroe, 1961). For details on pro- 
cedure, pictures of the tests, summary statistics, correlation and unrotated 
factor matrices, and more detailed discussion, the interested reader is referred 
to these reports and to the two other technical reports comprising this series 
(Nicks & Fleishman, 1960, and Fleishman, 1962). { 

2 The article was written during the author's year as а Guggenheim Fellow. 
The author is indebted to the John Simon Guggenheim Foundation and to 
Yale University which provided a Senior Faculty Fellowship during this year. 


647 


648 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


diversity of possible tests into two separate studies. The first in- 
volved tests in the area of strength measurement, which includes a 
large proportion of existing fitness tests and seemed to yield the 
most consistent set of factors in previous research. The second 
study included ability areas which minimize strength but empha- 
sized such features as speed, flexibility, balance, and coordination. 
While considerable previous work had been done in the strength 
area, these latter areas were not well defined. Each study involved 
considerable pre-testing before at least three tests were included to 
represent each hypothesized factor. 

Hach study was designed to (1) clarify the generality and limits 
of factors, in these respective areas, in a wide range of tasks, (2) 
sharpen the definitions of the factors, and (3) discover which tests 
provide the best measures of the factors identified. Each study 
had specific objectives as well. We will summarize these studies 
in turn. 


Analysis of Strength Tests 


A testing team was established at the U.S. Naval Training Cen- 
ter, Great Lakes, Illinois. Three research assistants from Yale, 
assisted by 12 senior Navy petty officers, administered the 30 tests 
included in the “strength battery” to 201 Navy recruits. The 
average age was 18 yrs 3 mos (S.D. = 1 yr, 3 mos); average 
weight was 150.6 lbs (S.D. — 20.3); average height was 5/10" 
(S.D. — 2'8"). Administrative order of the tests was determined 
from joint considerations of 2) fatigue effects, b) the number of 
examiners, c) traffic flow from groups to individually administered 
tests and from indoor to outdoor facilities, and d) the number and 
sequence of hours we could arrange in the regular basic training 
schedule. Each subject was tested in three sessions of approximately 
215 hours each. 

Table 1 summarizes the study design, shows the range of tests, 
and the factors they were hypothesized to measure. Familiar tests 
were used when possible to throw light on their factor structure. 
Efforts were made to allow for the possible appearance of additional 
factors. Systematic test variations were introduced. For example, 
were there separate strength factors confined just to legs, arms, ОГ 
trunk muscles? Did factors correspond to arm extensor or arm flexor 
muscles? The role of “endurance” in strength tests was evaluated 


EDWIN A. FLEISHMAN 649 


by comparing tests requiring subjects to hold statie positions of 
strain for prolonged periods (see e.g. ^Bent arm hang," Hold half 
push-up), with other tests which required continual exercise of these 
arm muscles for as long as possible (Pull-ups and Push-ups). Still 
other tests attempted to rule out “endurance” altogether by invok- 
ing short time limits (e.g. “Do as many pull-ups as possible in 20 
sec."). Would one or more separate endurance factors emerge? 

The inclusion of sprint and run tests was aimed at clarifying the 
role of endurance and strength factors in such performances. 


Brief Test Descriptions 


1. Leg Lifts. While lying on one's back raise the legs to a vertical 
position as many times as possible in 20 seconds. 

2. Push-Ups, Do as many as possible in 15 seconds. 

3. Reverse Sit-Ups. Lying prone, hands behind neck, raise upper 
half of body as many times as possible in 20 seconds. 

4. Deep-Knee Bends. From an erect position, lower to a squat, 
rise, repeat as many times as possible in 30 seconds. 

5. Sit Ups. Do as many as possible in 30 seconds. 

6. Squat Thrusts. From a “push-up” position, arms outstretched, 
jump the legs under the body and extend them as many times as 
possible in 30 seconds. 

7. Pull Weights—Arms. While lying face down on a bench, pull a 
37 Ib. barbell up to the bench as many times as possible in 20 
seconds. 

8. Hand Grip. Squeeze a Narragansett Co. grip dynamometer as 
hard as possible. 

9. Push Weights—Arms. Lying on back, press a 37 lb. barbell 
away from chest, as many times as possible in 20 seconds. 

10. Arm-Pull—Dynamometer. Pull against a dynamometer mounted 
on a pole. 

11. Push Weights—Feet. The subject wears “iron boots” into which 
a small 17 lb. barbell is inserted. While on his back with his 
knees drawn toward his chest, he pushes up and lowers these 
weights as many times as possible in 20 seconds. 

12. Trunk Pull —Dynamometer. While seated, pull forward as far 
as possible against a strap fastened to a dynamometer. 

13. Rope-Climb—Time Limit. Climb as high as possible in 6 sec. 
using only hands. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


lad им 
мм им 


MOM OM x x x 
м 


х х 


quang ae quang qj3ueng quang 89] вишу suny opusu зип], 839I lees ‘xal 
301 oarso[dx 213935 omuvuA( 3209135 o19 -Aqa ЧуЗчәлуѕ 

әлтѕоүахт əvusmpug отшви^(т Жако 

orureuA(q 


$10399,[ үззәпәгу Ә[415804 81030€,[ Aunq epqmsoq 


sjsap үрүшәшыәах oy) ur pozisewjodfgy $407] ү}биәлу 21918804 
I WISV.L 


P ч 


dung 
psoig Zurpueyg “LT 
(arum оз) sdiq сөт 
dump [990339A ‘GT 
("998 OT ш) sdiq ‘FI 
qui) edow "er 


поа uni "cr 


—Ssiq319A usnq “IT 
“suqma WY ‘OT 


spuoqoouy dəəq 


2 
2 
u‏ 
نہ ف کج ف قات ف Bd‏ 


3nd ma род 5 
x dn-ig JH PICH “9% 
x x 


is 0g ut) sdn-ma б 


Used P19X 09 "IC 

x Suey wy yug ‘0G 

s * Е x yseq pA OT '6I 

x x ләт} JT "ST 
————————————————————————————— 
o q3uong quang q3ueng wj3ueng 839ү suuy suny onvg огшви suni, 8Зәт “хя ‘хәр 

шү әло[їх{ 213935 ormeudq  uj3ueng ха q2ueng suy 

9AISO[dX7T eouuinpurr opuvuA(T —u)gueng 


810398 [#10099 9[qreSoq 810399, Атвшы[ opqresoq 


652 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


14. Dips—Time Limit. Using only arms, lower and raise the bod} 
between parallel bars as many times as possible in 10 seconds, 

15. Vertical Jump. Jump as high as possible with hands at side. - 

16. Dips—to Limit. Do as many as possible. 

17. Standing Broad Jump. 

18. Leg Raiser—Timed. Lying on back, lift legs approximately 
inches off floor and hold as long as possible. 

19. 10 Yard Dash—Timed. 1 

20. Bent Arm Hang—Timed. Hold self up on chinning bar, wit 
eyebrows level with bar, for as long as possible. + 

21. 50 Yard Dash—Timed. 

22. Pull-Ups—to Limit. Do as many as possible. 

23. Shuttle Run—Timed. Run back and forth, 5 times, over & 
yard distance (total 100 yds.). 3 

24. Pull-Ups—Time Limit. Do as many as possible in 20 second 

25. Medicine Ball Put—Standing. Using one hand, throw, as far a 
possible, a 9 Ib. medicine ball, without moving feet. 

26. Hold Half Sit-Up—Timed. With hands behind neck, from 8 
supine position, hold back rigid (at approximately 40° ang 
with floor) for as long as possible. 

27. Medicine Ball Put—Sitting. While sitting, propel a 9 Ib. m 
eine ball as far as possible away from body (using only arms): 

28. Hold-Half Push Up—Timed. From a “push-up” position, 
up so that 90° angle is maintained at elbow; hold this post 
as long as possible. 

29. Softball Throw. Without moving feet, throw as far as poss 

30. Push-Ups—to Limit. Do as many as possible. 


Results 


Scores on these 30 test variables together with 11 supplemen 
background variables were intercorrelated and factor analyz 
the centroid method. Rotation to a simple structure criterion 
accomplished using Kaiser’s Varimax analytical solution. Fi 
extraction and rotation was carried out using programs for the 1b 
650 computer. Table 2 presents the rotated factor matrix. 


Factor Interpretations 


Factor I is defined as Dynamic Strength, the ability to- 
muscular force repeatedly or continuously over time. It repre 


EDWIN А. FLEISHMAN 653 


TABLE 2 
Rolated Factor Loadings (Great Lakes Study)* 


Factors** 
I IE n IV WO ТИИ 
Variable DS SS ES TS WB AEG AES 
1. Leg Lifts 82 18 28 ЗОО О 
2. Push-ups (in 15 sec.) 68 15 233 22 —04 05 —09 60 
3. Reverse Sit-ups 04 о 26 06 20 20 =6 18 
4. Deep Kneebends 25 -08 25 21 25 04 -21 28 
5. Sit-ups 31 05 233 23 08 -02 15 28 
6. Squat Thrusts 45 11 11 14 00 40 14 27 
| 7. Pull Weights—Arms 11 83 92 08 50 110 ОО ОШ. 
8. Hand Grip 09 72 03 —09 05 06 —09 55 
9. Push Weights—Arms 38 51 21 11 44 12 -06 67 
| 10. Arm Pull, Dyna. 16 71 03 —02 —04 08 03 54 
11. Push Weights—Feet 25 35. (08 28 49 ИЕТ 
12. Trunk Pull, Dyna. —13 59 06 13 04 06 02 39 
13. Rope Climb 67 —03 41-1 06 06 20 66 
14. Dips (in 10 sec.) 70 00 33 16 08 12 —239 "4 
- 15. Vertical Jump 30 18 64 —02 —01 22 —06 58 
16. Dips (to limit) 68 05 (27 17 "B ida OO 
17. Standing Broad Jump 35 15 66 M1 700 7705 60 
18. Leg Raiser 35 —10 —02 43 12 00 —02 33 
19. 10 Yard Dash 28 07 70 12 —01 10 —-01 59 
| 20. Bent Arm Hang 73 —06 16 18 12 03 08 61 
21. 50 Yard Dash 44 07 75 02 —05 20 03 80 
22. Pull Ups (to limit) 81 —05 29 —07 10 —03 —04 76 
23. Shuttle Run 89 —04 77 OMT O01 "T0 2700€ 
24. Pull Ups (in 20 sec.) 78 04 40 -10 04 00 02 79 
25. Medic. Ball Put 
' (stand.) 09 71 #2 06 01 09 —04 59 
26. Hold Half Sit-up 30 09 13 45 —18 -03 05 35 
27. Medie. Ball Put (sit.) 02 44 26 11 22 02 20 36 
28. Hold Half Push-up 68 05 08 12 07 —05 21 54 
29. Softball Throw 09 32 54 25 —06 13 10 49 
30. Push-ups (to limit) 74 14 0 17 07 04 -17 6 
31. Height —39 42 —21 -31 —15 —04 32 59 
32. Weight —43 70 —23 -08 2 -04 ми 81 
33. Age —04 09 -—06 -—06 28 06 10 11 
34. Gen. Classif. Test 01 —08 —16 —08 29 03 —05 13 
85. Athletic Exper. Scale — —03 05 13 -01 11 89 32 92 
| 36- Athletic Versat, Index 15 13 08 —06 01 45 64 65 
\ 37. Football Experience · —10 14 10 —09 19 45 28 36 
| 98. Basketball Experience —04 —05 10 27 05 51 38 p 
| 39- Baseball Experience — —05 04 07 29 07 27 48 
| 40. Track (Run.) Experi- 
M Trane 02 07 17 -05 —04 6 09 48 
| 41. Track (Field) Experi- 
si^ ee 10 13 —01 —06 —03 63 02 43 


* Factor loadings h; ded to two places and decimals omitted. А 

gi Factors are identified as followes 1, Dynamie Strength; П, Static Strength; Ш, ends 

ATODBth: IV, Trunk Strength; V, Weight Balance; VI, Athletio à 
thletio Experience—Specifio, 


654 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


muscular endurance and emphasizes the resistance of the musele 
to fatigue. The common emphasis of tests loaded on this factor is © 
on the power of the muscles to propel, support, or move the body- 
repeatedly, or to support it for prolonged periods. The best measures. 
of this factor involve arms, but the factor extends to tests involving 
legs (e.g., runs) and trunk muscles (e.g., sit-ups, leg lifts). 
factor is common to “endurance” and “time limit” strength 
and no separate “flexor” and “extensor” factors appeared. The neg 
tive loadings of weight and height indicate that subjects with mo 
body mass are less likely to score high on this factor. 


foree against either a dynamometer, a relatively heavy weight, 
some fairly immovable object. The factor is defined as 8 
Strength, and represents the maximum force a subject can ex 
even for a brief period. In contrast to other strength factors, thi 
the foree which can be exerted against external objects. It is general 
to different muscle groups (hand, arm, back, shoulder, legs) and t 
different kinds of tasks, Body mass variables are positively corre 
lated with performance on this factor. 

Factor III contains tests included to emphasize Explosit 
Strength® This factor emphasizes the ability to expend a maximum 
of energy in one or a series of explosive acts. This factor is dis 
tinguished from other strength factors in requiring mobilization 0 
energy for a burst of effort, rather than a continuous strain, stre 
or repeated exertion of muscles. This apparently is the main fae 
accounting for individual differences in dashes. Presumably speed i 
dependent on effective mobilization of force against the ground i 
propelling one’s self forward. 

It is also of interest to note that certain dynamic strength 
have secondary loadings on this factor. These turn out to be # 
versions of tests given under time-limit conditions. Thus, instruct 
ing subjects to perform “as rapidly as possible" is more likely to 


з Elsewhere (Fleishman, Kremer, & Shoup, 1961) we have conceptualized ti 
three main Strength factors in terms of a physical model applied to 
systems. Briefly, a system of muscles may be crudely compared with a # 
line engine which takes in oxygen and hydrocarbons, combines them with 
release of energy and exhausts carbon dioxide and water. Both systems con 
chemical into mechanical energy. The three strength factors Static 8 
Explosive Strength, and Dynamic Strength, can be related to the р 
parameters of Force, Energy, and Power, respectively. The іп : 

, can pursue our discussion of this in the more detailed report. 


EDWIN А. FLEISHMAN 655 


bring into play this secondary factor than the “endurance” (to 
limit) versions of these tests. Endurance-limit tests are thus more 
pure measures of Dynamic Strength. 

Factor IV is labeled Trunk Strength since it is confined to three 
tests emphasizing the strength of trunk muscles, particularly ab- 
dominal muscles. 

Factor V is а narrow factor restricted to just those tests involving 
manipulations of weights. It is unlikely that this represents an 
important strength factor and for the present it is labeled Weight 
Balance. 

Factors VI and VII are confined to “Athletic Experience” vari- 
ables, representing two patterns of sports participation. 


Analysis of Speed, Flexibility, Balance, and Coordination Tests 


Testing for this study was accomplished at the U.S. Naval Train- 
ing Center, San Diego, California. Test administration was carried 
out by four professional physical educators assisted by 20 Navy 
petty officers. The 30 tests in this battery were administered to 204 
Naval recruits, following the same principles described above with 
the Great Lakes study. It was possible to process a company of 60 
boys in 2 hours. The average age of the subjects was 18 yrs 6 mos 
(S.D. = 1 yr 5 mos); average weight was 153.7 lbs (S.D. = 18); 
average height was 5/915" (S.D. = 24”). i 

Aside from the general objectives stated previously, test varia- 
tions were introduced, a) to discover if a “coordination” factor 
emerges common to the more complex tests, b) to see if there is a 
Speed factor general to all speeded tests, c) to see if flexibility and 
| Speed factors correspond to limbs or to specific muscle groups, and 
d) to see if balance with eyes open or closed introduced different 
factors. The rationale for test inclusion will become clear in the 
following descriptions of tests and hypothesized factors. 


- Brief Test Descriptions 
Hypothesized Factor—Eztent Flexibility. (Ability to extend or 
stretch body.) 
1. Abdominal Stretch. From standing position bend as far back as 
Possible while hips are strapped to fence. 
Toe Touching. While standing on bench, bend as far forward as 
Possible, with knees locked. 


65 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


3. Twist and Touch. With arm outstretched, twist as far around 
as possible, touching scale on wall. 

Hypothesized Factor—Dynamic Flexibility. (Ability to make 
rapid, repeated, muscle-flexing movements.) 

4. Squat, Twist, and Touch. Requires subject to go through 
many cycles of twist, touch, squat, touch movements (in alter 
nate directions) as possible in 30 sec. 

5. Bend, Twist, and Touch. Requires speed of flexing, extendi 
and rotating spine. 

6. Lateral Bend. Requires speed of flexing trunk muscles laterally, 
Hypothesized Factor—Speed of Arm Movement. 

7. Plate Tapping. Speed with which subject can horizontally 
duct and adduct arm in 20 seconds (tapping two plates 2 fee 
apart). d 

8. Arm Circling. Speed with which subject can cireumduct а! 
(circular movements around waste basket). 

9. Block Transfer. Speed of flexing and extending elbow (transfers 
12 one inch cubes, between trays 6 inches apart). 

Hypothesized Factor —Speed of Leg Movement. і 

10. One Foot Tapping. Speed of horizontally abducting and adduct 
ing leg (lift and return foot over a 4 inch partition to tap & 
board). 1 

11. Two Foot Tapping. Speed of flexing and extending hip joint (tv D 
taps at a time made with alternate feet, on a 12 inch kick board 
18 inches above floor). 

12. Leg Circling. Speed of circumducting legs (while braced, swing 
leg around waste basket as many times as possible in 15 see.) 

Hypothesized Factor—Speed of Change of Direction (Ability to 
change direction of body movement.) 

13. Dodge Run. (Lateral change.) Speed of running around si 
chairs placed in a pattern. 

14. Shuttle Run. (Abrupt reversal of movement.) Runs betwee 
parallel lines (15 feet apart) making five round trips. 

15. Circle Run. (Continuous change in body movement.) Speed £ 
running around a 12 foot circle 5 times. 

Hypothesized Factor—Coordination. (Ability to perform а num 
ber of body movements simultaneously.) 

16. Figure 8 Duck. Runs in a pattern requiring ducking and 
straightening of body in motion. 1 


EDWIN А. FLEISHMAN 657 


17. Grass Drill. Run on all fours around pattern of chairs. 
18. Soccer Dribble. Dribbles soccer ball with feet around chairs set 
in a pattern. 
Hypothesized Factor—Static Balance. (Maintenance of body 
equilibrium.) 
19. One Foot Lengthwise Balance—Eyes Open. Balances with 
foot parallel to rail, hands on hips. Score is time balanced. 
20. One Foot Lengthwise Balance—Eyes Closed. Same as 19, with 
eyes closed. 
21. One Foot Cross Balance—Eyes Open. Ball of foot perpendicular 
to rail. 
22. One Foot Cross Balance—Eyes Closed. 
23. Two Feet Lengthwise Balance—Eyes Open. Same as 19, but two 
feet are in contact with rail. 
24. Two Feet Lengthwise Balance—Eyes Closed. 
25. Two Feet Cross Balance—Eyes Open. 
26. Two Feet Cross Balance—Eyes Closed. 
Hypothesized Factor—Performance Balance. (Ability to main- 
tain body balance while in motion.) 
27. Rail Walking. Walk backwards around rail hexagon. (Score is 
Segments traversed.) 
28. Board Balance. Maintain balance on movable support (a teeter- 
totter). 
Hypothesized Factor—Balancing Objects. (Ability to balance ex- 
ternal objects with hands or fingers.) 
29. Stick Balance. Time a stick pointer is kept balanced on index 
finger, 
30. Ball Balance. Time volley ball is balanced on back of closed 
fist, when arm is held out at shoulder height. 


Results 


А Factor analysis of the intercorrelations of these 30 tests was car- 
Пей out as outlined in the previous study. Again objective analytical 
Computer rotations were used. Table 3 presents the rotated factor 
matrix, 


Factor Interpretations 


Factor I is defined by speed tests involving running or gross body 
Propulsion. Tests originally included to measure “Speed of Change 


658 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT | 


TABLE 3 
Rotated Factor Loadings (San Diego Study)* 


Factors** 


I ШАШ - ТУ ¥ тү М 


Tests ES GBE DF B-V EF SLM 
1. Abdominal Stretch ПОО 001 .01 55 15 8 
2. Toe Touch 28 06 12 02 39 12 26 
3. Twist and Touch —03 —07 08 09 49 —10 27 
4. Squat, Twist, and Touch INO Ба. —10 11. 17 
5. Bend, Twist, and Touch ОТОРИ во "02 10 21523 
6. Lateral Bend от 06 58 00 21 —07 40 
7. Plate Tapping 39 16 23 —04 —30 44 82 
8. Arm Circling 52 04 18 -09 —01 39 46 
9. Block Transfer 15 (01 56 13 —20 09 40 
10. One Foot Tapping 16 038 58 17 14 09 42 
11. Two Feet Tapping 11 05 19 19 20 46 3 
12. Leg Circling 297. 03 (48 26 —04 05 3837 
13. Dodge Run 69 03 05 03 —11 04 4 
14. Shuttle Run 63 08 19 01 09 21 & 
15. Circle Run 59 06 24 11 ов —18 4 
16. Figure-8 Duck 68 04 21 12 07 08 3 
17. Grass Drill 62 06 16 12 15 -08 46m 
18. Soccer Dribble 20 04 32 17 —07 -01 2 
19. 1-Ft. Lngth. Bal. Eyes Op. BOT ib éd о 0774 
20. 1-Ft. Lngth. Bal. Eyes Cl. —06 72 -03 -06 08 04 & 
21. 1-Ft. Cross Bal. Eyes Op. 10 38 04 55 -07 13 80 
22. 1-Ft. Cross Bal. Eyes Cl. 07 54 2 12 —o5 17.89 
23. 2-Ft, Cross Bal. Eyes Op. 08 53 00 32 —06 06 40 | 
24. 2-Ft. Cross Bal. Eyes Cl. 05 64 -6 01 —o2 12 2M 
25. 2-Ft. Lngth. Bal. Eyes Op. —16 24 19 26 —01 31 ?8 
26. 2-Ft. Lngth Bal. Eyes Cl. 02 13 15 08 05 33 0 
27. Rail Walking 05 44 2 2 02 —15 3l 
28. Board Balance 08 27 80 29 п 13 7 
29. Stick Balance ез. о. 33... 11. 1 
30. Ball Balance 00 02 -10 22 14 4 9* 
m M 0. 0 -] 22 м AC 


* Rounded to two places and decimals omitted. 
** The factors aro identified as follows: I, Explosive Strength; II, Gross Body Equilibrium: 
Ш, Dynamio Flexibility; IV, Balance—Visual Cues; V, Extent Flexibility; VI, Speed of Lim 


of Direction” fall on this factor, but there are other tests as well. и 
Other speeded tests (e.g, Bend, Twist, and Touch) are not on this | 
factor, so this is not а general speed factor. This factor is interpreted 
as the "Ezplosive Strength" factor, described under the strength 
study, above. 

Factor II is defined by the balance tests involving maintenance of 
body equilibrium. “Performance balance” tests did not load on this 
factor. The factor is general to “eyes-open” and “eyes-closed” 
balance tests, but is best measured when the eyes are kept closed. 
The factor is labeled “Gross Body Equilibrium.” 


EDWIN A. FLEISHMAN 659 


Factor III is defined by tasks originally designed to measure 
Dynamic Flexibility. There is an absence of “extent flexibility” 
tests. Tests on this factor emphasize both speed and flexibility of 
repeated trunk and/or limb movements. 

Factor IV is defined only by the balance tests given with the eyes 
open. The indication is that such tests involve an additional ability 
which emphasizes the use of visual cues in maintaining balance. 
We call this simply Balance—visual cues. 

Factor V is confined to those tasks included to define Extent Flex- 
ibility. The tests require stretching of the trunk and back muscles as 
far as possible, without speed, either laterally, forward, or back- 
ward. 

Factor VI is defined by three of the tests included to measure 
“Speed of Arm” or “Speed of Leg Movement.” The explanation of 
loadings of the other tests may be in the rapid adjustive arm or leg 
movements required. In any case, this factor is renamed “Speed of 
Limb Movement.” 


Conclusions 


Ten primary factors were identified to account for performance 
on the 60 performance tests. Table 4 summarizes the seven most 
general factors as well as recommendations for tests providing the 
best measures of these factors. The tests are recommended on the 
basis of factor loadings, reliabilities, and ease of administration. 
While some tests have familiar names, their administration proce- 
dures may have been changed to emphasize certain factors and the 
interested reader should consult the more detailed reports. Many 
familiar tests were found inadequate or overlapping with tests often 
given in the same battery. For example, Shuttle Run and 50 Yard 
Dash correlate .80 with each other, and Broad Jump correlates ap- 
proximately .70 with both. The most widely used battery today in- 
cludes all three tests, and clearly little new information about а 
student’s proficiency is added by two of the tests. In some cases а 
new test, turns out to be a much better measure than a traditional 
one (e.g. Leg Lifts over Sit-Ups as a measure of trunk strength). 

There are, of course, still some unanswered questions. One of the 
most intriguing concerns the nature of “coordination” and agility. 
We had tentatively identified such an ability in earlier work (Hem- 
ple & Fleishman, 1955) but were unable to confirm it here. A con- 
certed effort needs to be made to see if these are usefully considered 


660 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 4 
Reliabilities and Factor Loadings of Recommended Basic Fitness Tests 


Primary Other 
Primary Factor Relia- Factor Factor 


Test Measured bility Loading Loading 
1. Twist and Touch* Extent Flexibility .90 .49 € 
2. Bend, Twist, and ^ Dynamic Flexibility .92 .50 — 
Touch** 
3. Shuttle Run Explosive Strength .85 77. .39 (DS) 
4. Softball Throw Explosive Strength .93 .66 .32 (SS) 
5. Hand Grip Static Strength .91 .72 — 
6. Pull-Ups Dynamic Strength .93 .81 z 
T. Leg Lifts Trunk Strength .89 .47 .82 (DS) 
8. Balance A*** Gross Body Equilib- .82 .72 = 
rium 
9. Cable Jump Gross Body Coordi- .70 .56 ES 
nation 
10. 600 Yd. Run-Walk Stamina (Cardio- .80 x» S 
Vascular Endurance)t 
* Later called Extent Flexibility Test. 


** Later called Dynamic Flexibility Test. 
*** Originally called One Foot Lengthwise Balance—Eyes Closed. 
t Factor hypothesized; represents interim coverage. 


"separate" abilities. In the meantime, Table 4 includes the test 


found to measure this factor in this previous study. The area of | 
cardio-vascular endurance (stamina) needs to be explored; it was - 


not possible to include tests of prolonged exertion in the present 
batteries. A commonly used measure of such a hypothesized factor is 
included in Table 4. The 10 tests in Table 4 would seem to offer à 
battery of Basic Fitness Tests of 9 proficiency factors, based on 
current evidence. 

As a follow up to this work, a national study involving 20,000 
students in 45 U.S. cities has been completed and reported elsewhere 
(Fleishman, 1962). This study provided test norms and standards 
for a battery of Basie Fitness Tests, and provided developmental 


curves, plotted for ages 12 to 18, on the component physical fitness - 


measures. 


REFERENCES 


Fleishman, Edwin A. The Dimensions of Physical Fitness—The 
Nationwide Normative and Developmental Study of the Basic 
UM us IE Research, Contract Nonr. 609 (32), Tech- 
nical Keport 4, Yale University, August, 1962. 

Fleishman, Edwin A., Kremer, Elmar J „ and Shoup, Guy W. oe 
Dimensions of Physical Fitness—A Factor Analysis of Strengt 


EDWIN A. FLEISHMAN "3 f 


- Office of Naval Research, Contract Nonr. 609 (32), Tech- 
Report 2, Yale University, August, 1961. 
; Edwin A., Thomas, Paul, and Munroe, Philip. The Di- 
sions of Physical Fitness—A Factor Analysis of |, Flez- 
‚ Balance, and Coordination Tests. Office of Naval Re- 
, Contract Nonr. 609 (32), Technical Report 3, Yale 
Iniversity, September, 1961. r 

el, Walter E. and Fleishman, Edwin A. “A Factor Analysis of 
ical Proficiency and Fine Manipulative Skill.” Journal of 
pplied Psychology, XX XIX (1955), 12-16. S р 

Delmer C. and Fleishman, Edwin A. What Do Physical Fit- 

Tests Measure?—A Review of Factor Analytic Studies. 

of Naval Research, Contract Nonr. 609 (32 , Technical 

1, Yale University, July, 1960. (Also summarised in 
E" AND PsycHOLOGICAL MEASUREMENT, XXII (1962), 


Ф 


d RiT 
] eg tas | 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 4, 1963 


FORMULA SCORING AND VALIDITY 


FREDERIC M. LORD 
Educational Testing Service 


Bx formula score is meant here the score obtained by subtracting 
from the number-right score a fixed proportion of the number 
wrong. The present paper attempts a limited mathematical approach 
to the question: ^What is the sign and general magnitude of the 
difference in validity to be expected between the number-right score 
and a formula score on the same test?” A subsidiary question that 
is also considered concerns the amount of decrement in validity 
resulting from random guessing. 

In the present case, the theoretical approach is not a satisfactory 
Substitute for carrying out a large number of comparative validity 
Studies in practical situations. On the other hand, the changes in 
validity produced by formula scoring are usually so small as to 
make it difficult to demonstrate empirically their statistical sig- 
nificance unless very large numbers of cases are available. 

Actually, the main arguments for and against formula scoring 
Probably are not to be found in group statistics at all, but rather 
in the undesirable effects of one kind of scoring or the other for 
certain individuals. Nevertheless, it may be of interest to investigate 
here just how much difference in group statistics is produced by 
formula scoring in certain restricted situations. 

The first section presents some background material. The second 
and third sections outline the assumptions made and present the 
formulas derived. The next section presents some representative 
numerical results computed from the formulas and discusses their 
implications, The final section details the derivations of the form- 
ulas, 


663 


64 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


For convenience, the diseussion will use the terminology appro- 
priate for aptitude and achievement testing, although the reasoning 
is clearly applicable in certain other types of testing situations. 


Preliminary Considerations 


"Three points commonly made about formula scoring will now be 
examined briefly. 

1. Formula scoring corrects only for random guessing, but ex- 
aminees seldom guess purely at random; their responses are deter- 
mined by partial information, by misinformation, and by relevant 
and irrelevant cues. Thus the examinee's actual state of knowledge 


with respect to the test questions cannot be inferred, either logically - 


or statistically, from his responses to the test questions. This con- 


clusion is clear when it is realized that there are only three objec- - 


tively distinguishable kinds of responses and that each may arise 
from entirely different levels of knowledge. Right answers may arise 
from complete knowledge, from partial knowledge, or from guess- 
ing; wrong answers may arise simply from guessing or from vari- 
ously undesirable degrees of misinformation; omitted answers may. 
represent ignorance or various degrees of knowledge, up to and in- 
cluding complete knowledge in the case of partially speeded tests. 
Clearly the amount of the examinee’s “knowledge” cannot be esti- 
mated from his marks on the answer shect without the use of 
further assumptions. 

2. Formula scores correlate very highly with rights scores on the 
same test. This empirically determined fact is not sufficient reason 
to dismiss formula scoring from further consideration, as some 
appear to believe. Suppose that the correlations between formula 


score (u), rights score (x), and some outside criterion (c) are given 
by the matrix 


fu Te 1 
Any set of values for the тз may occur in practice provided only 
that the determinant of the matrix be nonnegative, i.e., provided 
that 


1-0-0. 9» o. > 0. D 


OO cee eee ete 


FREDERIC M. LORD 665 


This inequality is the same as the following: 
ты» — Мы — fu MS (2) 
E rur + М2 — tee т +1. 


For example, if rj, = .99 and т. = .60, we find from (2) that 
48 = т = .71. Thus a high correlation between u and х does not 
prove that one measure is about as good as the other. 

3. According to Guilford (1954), the decrease in validity when 
one abandons formula scoring is in practice likely to be of the order 
of .00 to .03. From many points of view .03 is only a small reduction 
in a validity coefficient. Consider, however, the standard formula 
(Gulliksen, 1950, ch. 9, eq. 9) 

К = тк, (1 TA Taz) à (3) 

Tze — Таке 

where res and r,, are the reliability and validity of a given test, 
Tre is the validity obtained on a shortened form of the same test, 
and K is the factor by which the test length must be decreased to 
secure a validity of тко. Suppose now, for example, that for some 
formula-scored test rg, = .90, т = .60, and that a reduction in 
validity from .60 to .57 would result from abandoning formula scor- 
ing. By (3), this reduction in validity is the same as would occur if 
the test length were halved (К = 0.48) ! In this case, failure to use 
formula scoring produces a decrement in validity equivalent to 
throwing away one-half of the test items and one-half of the testing 
time, or equivalent to ignoring one-half of each examinee's responses. 

Let us now proceed to set up a mathematical model that will pro- 
vide an indication of the increments or decrements in validity to be 
expected in certain limited types of situations, with the hope that 
these will not be too atypical of those to be found in other prac- 
tical situations. 

Assumptions 

The assumptions made are as follows. | 

1. The number of right answers, уг, that examinee а actually gives 
on an n -item test is composed of two additive components: А 

та, the number of right answers that would be given if guessing 


were held to à minimum; F4 
ga, the number of right answers obtained (on the remaining 


n — т, items) by guessing. 


666 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


The kind of guessing to be considered is limited by the assum) 
that follow. 

2. It is assumed that p, the probability of success by guessing 
any given item, is the same for any examinee and for any i 
(This assumption does not hold well for most tests; it should h с 
approximately, however, for some mathematical and other types of 
test items where the right answer can ordinarily be obtained on y 
by working it out or by sheer guessing.) 

3. The present derivation limits itself to the case where P, 
propensity of examinee a to guess under any given set of test d 
tions, i.e., the proportion of the n — z, items on which he will choose 
to guess, is uncorrelated with and totally independent of £a; а 
Ca, the criterion score against which the test is to be validated. T 
lack of correlation between P, and Ta is in accord with the empi ri 
findings of Swineford (1938, 1941). D 

It is seen that for a given examinee under a given set of te 
directions, g, is a binomial variable with expected value pP, (n — 
If the random error arising from guessing is denoted by ег = gs 
— pP. (n — xq), then we have 


Ya = Ta + 9, 
t, + pP.(n — x.) + e, 


= 2, + npP, — pP.2, + e,. 


The number of omitted items on the test under this mathemati 
model may be denoted by Ё 


Ue = Qin — z), 

where Q, = 1 — P,. р 
With the help of (4) and (5) we ean obtain formulas for the 
validity of the rights score, ya, and for the validity of the cor- 
responding formula score. An important point is that the assump 
tions have been во chosen that these formulas will depend upon a & 
of test parameters for which it is possible to obtain armchair esti- 
mates that are reasonable, or that at least are self-consistent. 


Formulas for Validity 
The validity coefficient for the rights score will be shown to be 


п, = La рР)вл,, 


FREDERIC M. LORD 667 


where P is the mean of the P, in the group tested, ss is the standard 
deviation of the Ta, rea is the validity coefficient of the za, and sy is 
the standard deviation of the ya, computed from 


„= (1 — pD's + рп — zs)! + ps sr? + pan- (0) 
with q — 1 — p. 

If w, is the number of wrong answers given by examinee a, the 
usual formula score would be y, — pw,/(1 — p). It is well known 
that this is perfectly correlated with the score, here denoted by Va, 
obtained by adding an appropriate fraction of the omits (ил) to the 
number right: 


v, = Ya + pu, 
= 2, H pn — =.) H e 
= пр + qz, + €. (8) 


The perfect correlation results from the fact that, for every ex- 
aminee, 

Yo + wa + t, = n. (9) 
Thus the validity of the usual formula score is the same ав Tov, 
Which will be shown to be 


г, = Tir, (10) 
8, 
where 
s = gs? + рдп — 2). an 
From (6) and (10) it is seen that 
n pagine (12) 


It may be readily verified, further, that if P = 1 and вр? = 0, 
then s, = s, and roy = Tov. This corresponds to the well known fact 
that when everyone answers all test items, the correlation between 
rights score and formula score is perfect. 


Numerical Illustrations and Discussion 


The decrement caused by guessing can be summarized by п 
ratio r,/r,; the increment achieved by formula scoring, by the 
ratio r,/r,,. These two ratios, computed from formulas (6) and 


668 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


(12), are given in the antepenultimate and penultimate columns of 
Table 1 for a number of hypothetical tests. The last column gives 
the factor by which а test of reliability rz, would have to be length- 
ened in order to achieve the increment in validity shown in the 
preceding column. 

The parameters necessary to describe each test are given in the 
first seven columns in the body of the table. The choice of numerical 
values requires some discussion. The values of p — 0.20, 0.25, and 
0.5 are chosen to represent approximately the situation with certain 
types of 5-choice, 4-choice, and 2-choice tests, respectively. The 
reliability of the z-scores is not needed for the computations, but 
it is closely related to the variance, s;?, so a reasonable value of ra 
has been estimated from s? by the author's formula (1959, p. 237) 
and listed in the table to guide the reader. The value P — 0.5 indi- 
cates that on the average examinees in the group guess on half of 
the n — z, items. When P — 0.5, the variance of the propensity to 
guess is believed typically to be more than the listed value of 
5р2 = 0.02 and less than the listed value of sp? = 0.05. The case 
Р = 1, sp? = 0 is the case when all examinees answer every item. 

Most of the hypothetical tests in the table differ from test 1 in 
just one respect. A comparison of test 1 and test 2 illustrates the 
effect of reducing 2, When sp? is .05 the validity of the y-scores 
is only .98 times the validity of the z-scores; formula scoring raises 
the validity to 1.008 times that of the y-scores. This last increment 
is equivalent to that obtained by lengthening a test of reliability .91 
by twenty percent. When sp? — .02, the decrement due to guessing is 
less, and the gain achieved by formula scoring is consequently 
much less. 

A comparison of test 1 and test, 3 illustrates the well-known fact 
that when every examinee answers every item, the formula score 
(v) is perfectly correlated with the rights score (y) so that rov/toy 
= 1. The value of т/п, = Toy/ Tea = .976 illustrates the decrement 
due to guessing when everyone guesses. 

A comparison of tests 1, 4, and 5 illustrates the effect of going 
from 5-choice to 4-choice to 2-choice items. The decrements in 
validity due to guessing are respectively .98, .97, and .90; these cor- 
respond to shortening the test to 68, .59, and .27 of its original 
length, respectively. Formula scoring is seen to be more important 
for 4-choice tests than for 5-choice tests. For the 2-choice tests 


= Sc0'I 996° со" g` 00ё 4 [d 62 86 6 
ET €00'T 116° 90 9" og ses" ё` 6c 6F 8 
ТІ £001 F86’ S0: g 001 16° a 6с 6F 1 
0'1 0001 £16" 0 от 00ё T6* g 89 86 9 
Bc $9071 968 °0` 9" 00с 16: g' 89 86 9 
5 YI £IO'I 146* 90' g 002 16° so" 8¢ 86 ¥ 
Е 0T 0001 9/6` 0 OT 002 16` a 89 86 © 
| OT Z00°T 986° ©0` 9" 002 16` a 89 86 [4 
[2] [201 800°T 086° S0 a 002 16` e 88 86 I 
E x majos 94/23 48 d 8 es; d ғ и 980L 
*qj3uop489) «= 'Suuoos ‘Burssond *'A31suodoid ‘ssond 04 ((Susson3 ‘(Zutssond ‘sseoons = (Zuisson3 ‘suret jo 
щш әвтәюш — W[nuUl10j 0} әпр 5103 JO K318uodoad 910J9q) әлоўәд) 9ousqo]o ӘлОјәд)  iequnN 
quop[vA mb; шозу jueura19e([  әдивизд әйвләлу әопвшеА AQ  ÁAjmpqSqoiqp e1oos 
quouro1oug чвәрү 


qx 7‏ = د ڪڪ 


8189], рәїәәй 10] fippiyoA и soDuvy;) samay 
I ЯПЧҮУ1, 


670 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


` investigated, formula scoring produces a gain in validity larger than 
could possibly be obtained by extending the test to any length 
whatever. 

Test 6 illustrates the situation when there are no omits on a 
true-false test. Test 7 illustrates what happens when we shorten a 
test but keep its reliability unchanged. Test 8 illustrates the case 
where the test is shortened and the reliability allowed to decline 
accordingly. Test 9 shows what happens when the test becomes quite 
difficult for the group tested. 

The tentative conclusion seems to be that for 5-choice items, 
formula scoring may produce an increment in validity equivalent to 
that obtained by lengthening the test by twenty percent or more. In 
most cases, however, the increment will probably be less than this. 
Formula scoring seems to be important (1) when there is wide vari- 
ability among examinees in propensity to guess, (2) when there are 
fewer than five choices per item, (3) when the test is quite difficult 
for the group tested. 


Derivations of Formulas 


The formulas in the preceding section are readily derived from а 
few well-known basic formulas. The first is the general formula for 
the covariance (S) between two weighted sums: 


Sawer wiry = 25 22 w WAS, (13) 

a b 
where Xj, Xs, . . Xa, .. „Ху is any set of N random variables, 
Yy Ys; Yo Yu is any set of M random variables, the wa and 


Ws are the weights, and Sa» is the covariance between Xa and Y» 
The formula for the variance of any weighted sum is readily 
written down from (13): 


N ON 


8 PN = У? 2. WaW Sac, (14 


а=1 bel 
where Seo is the variance of X. rs 
The second basic formula states that if the variables X and Y are 
independent, then any function of X and any function of Y will 
be uncorrelated. In particular, if X and Y are independent, 


i x X/Y’ = t рэ x.) 1 x Y) (15) 


for arbitrary f and A. 


FREDERIC M. LORD ET 
Since the mean of e is zero, it is readily seen from (4) that — 


1 
Pw Lr XB-E È Pa. 
Since P and z are uncorrelated, 


y УР, = Pe, (16) 


y = 2 + рР(п — 2). (17) 


& = 8,2 Hn pse + وی وم‎ +82 in 

Toul + 2npS.p — 2pS.cp2) — 2np Spo», (18) 
_ where s(»;)? is the variance over people of the product Pata, Secre) 
‘is the covariance of Ta With the product Pata, and so forth. These 
_ Variances and covariances of products are readily expressed in 


4. 


- simpler terms by the use of (15). 
— Thus, 
4 1 А 
Sp. = x 23 Pre a G E Pa) (19) 
ith the help of (15), (19) becomes 


ton = (E xr Da) - Р. (20) 


аво = 8p 8a + ap P F eel Qu 
* Similarly, we find S 
m So» = Ps,’. f (22) 
n Spire) = Sp. un 
- The variance of the errors is obtained from the usual binomial 
| formula, remembering that the number of “trials” for a given 

examinee is in this саве P,(n — л). For a given examinee the 


з? = i È P.(n — z)pa 
= pgP(n — 2). (24). 


672 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Substitution of equations (21) through (24) into (18) yields the 
formula for s,? given in (7). Formula (11) for sẹ? is similarly ob- 
tained from (8). 

To derive (6) we only need the covariance Sey, which is seen from 
(4) and (13) to be 


Sey = S. T npS.p = DS.» (25) 
The second term on the right vanishes. The last term is 


Sos = i У Рл, a e D Paz. (20) 
Since P, is uncorrelated with the product cers, by (15) 
Y XP = P x Dd at, (27) 
= P(S. + ez). 


Thus (26) becomes 


Setpa) E PS 
Finally, 


Ses = (1 gm pP)S... (28) 
Since теу = 5,,/s,s,, (28) yields the result already given in (6). 
Equation (10) is verified by a similar but simpler process. 


REFERENCES 
Guilford, J. P. Psychometric Methods. New York: McGraw-Hill 
Book Company, 1954, 


E Н, Theory of Mental Tests. New York: John Wiley & 

ons, Я 

Lord, Е. M. “Tests of the Same Length Do Have the Same Standard 
Error of Measurement." EpUCATIONAL AND PSYCHOLOGICAL MEAS- 
UREMENT, XIX (1959), 233-239. б 

Swineford, Frances. "The Measurement of a Personality Trait." 
Journal of Educational Psychology, XXIX. (1938) , 295-300. 

Swineford, Frances. “Analysis of a Personality Trait.” Journal of 
Educational Psychology, XXXII (1941), 438-444. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 4, 1963 


ADAPTING TESTS TO THE CULTURAL SETTING * 


PAUL A. SCHWARZ 
American Institute for Research 


ArniCcA has been the scene of much testing research but few test- 
ing programs. The data in Wickert’s (1960) survey, in the reports 
of individual countries (CCTA, 1960), and in Biesheuvel’s (1962) 
questionnaire study show that throughout tropical Africa little ap- 
plied testing is done. 

This can be partly ascribed to the administrative problems of 
introducing and operating large-scale selection or guidance pro- 
cedures. But the basic limitation has been the state of the art. For 
practical application, the extensive research on African testing has 
failed to produce adequate techniques. 

One factor is that the standards set by a few workers have been 
largely ignored by the rest. Although Biesheuvel (1952c) early 
described the conditions to be satisfied by suitable tests, most pub- 
lished results are based on superficially changed European or Amer- 
ican procedures. Although many writers (eg., Biesheuvel, 1952b, 
1956; Verhaegen, 1956) have explained the impossibilities of "еШ 
ture-free” tests and the errors inherent in cross-cultural comparisons, 
Such instruments and such studies have dominated the design of 
much past research. The result has been an accumulation of data 
with no practical implications. 

Those workers who did try to observe cultural limitations, more- 
over, have concentrated on apparatus tests. In addition to the ex- 
tensive use of such traditional procedures as Kohs’ Cubes (eg, 
—— 

"The study on which these findings = Pearsons date d 


Ported through Contracts ICAc-1434 and I T 
Institute for Research by the U. S. Agency for International Development. 


673 


674 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Romier, 1958) or fil de fer (e.g., Morin, 1955), indigenous research 
produced the Adaptability Test (Biesheuvel, 1952a), the Seven 
Squares Test (Hector, 1959), the Chopsticks and Tweezer-Nozzle 
Tests (Dewet, 1957), and numerous other devices. These proved 
reasonably effective for low-level screening decisions. Because of 
their cost, narrow measurements, and unsuitability for mass admin- 
istration, however, apparatus tests could not alone solve the broad 
range of African testing problems. 

The notion of a similar effort in developing effective pencil-and- 
paper procedures was given short shrift (Biesheuvel, 1952c), and, 
ten years ago, this judgment was probably sound. But as primitive 
societies became developing nations, it led to a serious gap. For 
people of little or no education there was a wide choice of available 
apparatus tests. For the elite at or above the secondary-school level, 
imported instruments could provide fair approximations. For ap- 
plieants with primary or middle-school training—who by now were 
the focus of most technical, vocational, and academic selection— 
nothing suitable had been produced (CCTA, 1961). To enable prac- 
tical testing programs, there was a clear-cut and growing need for 
appropriate pencil-and-paper procedures. 

The first major effort to develop such tests was begun two years 
ago, as part of the United States foreign aid program. This study 
has shown that, although extensive adaptation is required, pencil- 
and-paper devices can be made highly effective for African ex- 
aminees, and has generated a number of principles to guide test 
development in the countries of tropical Africa (Schwarz, 1961). 
Subsequent experience (Schwarz, 1962) and some related findings 
of Fontaine (1961), working with the Arab populations of North 
Africa, suggest that at least the basic approach developed in this 
research would prove generalizeable to quite different, locations. 

This paper describes the methodology of test adaptation that 
evolved from these earlier studies and from more recent findings in 
extending the work to other types of tests and to new populations. 
It suggests a number of specific modifications that have proved 
generally effective, and illustrates the approach to those problems 
that must be solved anew in each different setting. The aim is to 
provide a systematic procedure for the design of research on test 
adaptation. 

The discussion is divided into three sections, representing three 


PAUL A. SCHWARZ 675 


aspects of the standard testing process that must be adapted. Specific 
tests and studies here cited as illustrations of technique will be 
more fully deseribed in subsequent reports. 


The Specific Test Operations 


The construetion of tests to solve any selection problem begins 
with two basic decisions. The first determines the kinds of abilities 
that will be measured as likely predictors of success in the activity 
for which the candidates are being selected. The second specifies the 
exact operations that the candidates will be asked to perform to 
demonstrate these abilities in the test situation. 

A change in the cultural setting need not affect the first of these 
decisions. Validity studies have confirmed the common assumption 
that the abilities predictive of an activity in one culture are similarly 
predictive in other locations; e.g., that carpentry involves the same 
aptitudes in Nigeria as in the United States (Schwarz, 1961). The 
important adaptations are in the sample of tasks from which certain 
aptitudes will be inferred—in both the number and kinds of test 
operations. 

To appreciate the need for quantitative changes, it is instructive 
to administer a typical American test to a class of African sixth- 
graders. When each step must be explained, practiced, and cor- 
rected, this experience demonstrates the surprising number of skilled 
operations that make up the standard aptitude test. Such tests fail, 
irrespective of content, because they constitute too massive & learn- 
ing task for the African examinee. 

Yet, analysis will show that only a few of the component opera- 
tions are central to the specific ability being assessed. Most are of 
an auxiliary nature, deriving from a particular test format rather 
than the actual test task. In our much-tested culture, these purely 
mechanical operations can be essentially ignored. Publishers mix 
skills, formats, and rules within the covers of a single reusable test 
booklet; and quite justifiably assume that the abilities required in 
using such streamlined materials will not confound the measure- 
ments sought. But in societies not accustomed to the testing ritual, 
finding the correct answer may be no more of а challenge than find- 
ing the spot where it should be marked (Biesheuvel, 19520). 

‚ Devising а format stripped to essentials is the easiest of the 
major modifications. Most tests can be printed on a single sheet of 


676 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


paper. When more space is required, the several sheets can be dis- 
tributed and collected separately to avoid confusion. On objective 
tests, the responses can be made to coincide with standard scoring 
stencils, enabling rapid hand- or machine-scoring without separate 
answer sheets. For manual tests, a self-scoring type of paper can be 
used. And, because tests are most often used in combination, a con- 
sistent format can be adopted for similar tests. The materials are 
in all cases consumed, but the cost can be limited to a few cents 
per test (Schwarz, 1961). 

By lightening the over-all load, these modifications eliminate one 
serious obstacle to effective test performance. It still remains, how- 
ever, to insure the adequacy of the measured operations that will 
produce the test score. 

This is the stage at which specific environmental factors must be 
considered. If these tasks are to provide a reasonably stable meas- 
urement of an examinee’s more general performance characteristics, 
it is necessary that they require the application or generalization of 
knowledges and experiences he already possesses—that each task is 
backed by adequate environmental supports. To write the actual 
lest exercises, it is further necessary that the nature of these sup- 
ports is known. Depending on the particular test, these requirements 
may or may not pose a problem. 

A logical starting point for devising tasks appropriate to a given 
measurement is afforded by the American or European tests of the 
same function. When analyzed in terms of environmental supports, 
these standard operations will be seen to require different kinds of 
adaptations, in accordance with the following three situations. 

1. When the environmental supports are adequate but unknown. 
This situation arises when it is reasonable to presume that there 
exist adequate supports to enable a particular form of measure- 
ment, but when the nature of these supports has not been deter- 
mined. Available data suggest only that they are different from 
comparable background factors in more developed cultural settings. 

A specific example is a test of mechanical information. This test 
assumes that a youth with mechanical inclinations will react to the 
mechanical aspects of his environment, and will acquire knowledge 
that can be tested as a measure of these inclinations, It is, in the 
United States, an excellent predictor of success in technical fields. 
But, obviously, the African youth has not the background to cope 


PAUL A. SCHWARZ 677 


with the gadget-oriented questions of the standard mechanical test. 

In the case of this example, one should be willing to assume that 
the African environment does provide adequate supports. Even in a 
remote village, the people build houses, prepare foods, and fashion 
utensils. The problem is to write items that sample the most widely 
available supports, and to do this the latter must be determined. A 
considerable research effort may be involved. In the AIR study, the 
development of an effective mechanical test took more than one 
year. It was necessary to study village life, find elements common 
to rural and urban settings, and to experiment with trial questions 
in а continuing process of item analysis and revision. But, eventu- 
ally, seventy suitable test items were produced (Schwarz, 1962).? 

Such extensive data collection should be anticipated for test oper- 
ations in this first category, and is the chief characteristic of the 
appropriate method of adaptation. The problems are in all cases 
soluble, provided that the requisite information is obtained. 

2. When the environmental supports are known but inadequate. 
In this situation, a deficiency in certain aspects of the environment 
attenuates the value of measurements which depend on these aspects 
for support. The deficiency is known, but, because the environment 
cannot be readily manipulated, the difficulties remain. 

In trying to predict scholastic success, for example, one naturally 
considers such tests as vocabulary, reading comprehension, and 
arithmetic reasoning, which have consistently high validities for 
academic endeavors. They are in part measures of attainment, 
showing how the examinee has performed in his past school efforts; 
and in part aptitude tests, indicating how well he can apply learned 
skills to more advanced situations. To infer individual differences in 
ability from such measures, roughly equal opportunities for learning 
must be assumed. 

At this stage of African development, such equal opportunities do 
not yet exist. There are huge differences in the quality of schools, 
especially at the primary level, with corresponding differences qr 
the standards attained. The very able youngster from а very “bush 
school can not compete effectively on the typical academic pre- 
dictors, so that their value for secondary school selection is essen- 


ы 
2 This study was done as а collaborative effort with Mr. gd niil 


estern Michigan University during his tour as an adviser 
College in Ibadan, Nigeria. 


678 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


tially lost. And it will be some years before the large inter-school 
differences will have dropped to a manageable level. 

Under these conditions, it should be recognized that there can be 
no better than compromise solutions. One possibility would be to 
rely more heavily on measures of general ability (i.e., non-verbal 
"intelligence" tests) at the primary level, and to use the academic 
predictors for only the higher stages of education. A more satisfy- 
ing approach is to try to sample learning experiences that are rea- 
sonably effective in most schools, but that nevertheless lead to 
enough variability in performance to permit meaningful measure- 
ment. Both are being evaluated in studies now under way. 

For the quantitative skills, a test of simple computation is being 
tried in lieu of the more demanding reasoning tests. The hypothesis 
is that the four basic operations are taught and practiced in all 
schools, so that individual differences in performance may be pre- 
dictive of more general abilities in this field. To construct an ele- 
mentary test of language skills, a count of the frequency with which 
English words are used by Nigerian school children has been begun, 
based on a sample of one million words. The hope is that a core of 
suitable words will be found, and that these can be used as the 
elements of effective language tests. A non-verbal ( intelligence) test 
of concept formation was developed some time ago (Schwarz, 1961). 

The characteristic of adaptations in this category, then, is to 
eliminate the standard operations, and to try to substitute correlated 
operations that have more adequate local supports. Although this 
may entail as much work as the above adaptations when the sup- 
ports were merely unknown, there is no guarantee of an effective 
solution. 

3. When the environmental supports are both adequate and 
known. This is the situation when the basic concepts required to 
perform the test operation have parallels in the daily lives of the 
examinees. Learning is reduced to the task of generalizing known 
principles to a new application, which is the task intended for apti- 
tude tests in all cultural settings. For operations in this category, 
Africa is functionally equivalent to the United States. 

Tryouts will show, perhaps unexpectedly, that most standard apti- 
tude tests fall into this group. Because such concepts as identity 
(clerical accuracy, object identification, hidden figures), association 
(symbol substitution), function (concept formation), and memory 


PAUL A. SCHWARZ 679 


are much used in the local culture, the corresponding operations 
need not be changed. This fact may be obscured by inadequate 
pruning of the auxiliary operations, or by inappropriate methods 
of explanation—a topic discussed in the next section. But, when the 
basic operations are cast in a proper format, highly accurate meas- 
urements are consistently obtained (Schwarz, 1961). This makes 
possible the fairly rapid construction of enough instruments to begin 
a testing program while the adaptations of the above more demand- 
ing tests are still under way. 

The over-all conclusions with respect to adapting test operations, 
therefore, are that in all tests the number of such operations should 
be reduced to essentials; that in some tests adjustments will have 
to be made for clear-cut differences in environmental supports; but 
that in most tests the standard operations сап be retained. 


The Media of Communication 


Once the test operations have been established, decisions must be 
made about how these requirements are to be presented to the 
examinees. In general, the testing process affords two channels of 
communication. The first is in the initial explanation of the test 
procedures, Here, the test and the examiner must communicate to 
each examinee precisely what it is that he is to do. The remaining 
inputs are supplied by the individual test exercises. Each item serves 
as a stimulus to which the examinee responds, and though his 
response may be wrong, the stimulus or “givens” at least must be 
clear. 

The initial explanation is critical in all kinds of testing situations. 
The inputs made by the items may be secondary, as in а repetitive 
coding or comparison task, where all needed information is supplied 
as part of the initial instructions. Or, also, these inputs may be 
critical, as in those tests (e.g., concept formation) that present some 
new data in each separate item. Ws 

For overseas applications, the normal means of communication 
must be adapted in both the instructions and the individual test 
problems, 


The Nature of the Problem 


Language factors. The standard aptitude test relies chiefly m 
Printed and spoken instructions. It is entirely adequate, with Amer- 


680 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


iean examinees, to explain the task in one or two paragraphs, 
illustrated with a few sample and practice problems; and to provide 
additional data or signals by printing them in the appropriate places 
on the test paper. But when, even in a country (e.g., Nigeria) that 
uses а world language as the lingua franca, it can prove difficult to 
communicate in this language with the teacher whose students are 
about to be tested, drastie changes in the standard methods are 
clearly required. 

A first attempt might be to translate the instructions and signals 
into the local vernacular, and this was, in fact, tried. In one early 
study, the standard instructions were given in English to one stream 
of Nigerian sixth-graders, and in Yoruba to the second stream at 
the same school. The percentages of students who did not know what 
to do at the end of the instructions were, respectively, 50 and 40 
per cent, suggesting that the medium rather than the language was 
the limiting factor (Schwarz, 1961). 

These findings, coupled with the similar cautions of Biesheuvel 
(1952c) and Fontaine (1961), urge that all printed instructions and 
signals should be dropped from tests for the primary school level. 
Finding an appropriate substitute is the chief problem in adaptation. 

Perceptual factors. When a pictorial symbol has no intrinsic 
meaning (e.g., in a hidden figures test) or when its significance need 
not be understood to perform the test operation (e.g., in object com- 
parison items), the African examinee does not seem to be handi- 
capped to any appreciable degree. But, when the referent symbol is 
important, difficulties arise. Having had limited contact with pie- - 
torial representation, African children will often fail to recognise 
drawings of highly familiar objects for what they are (Biesheuvel, 
19520). 

There has been much research (e.g., Fenseca & Kearl, 1960) оп 
the improvement of pictures used as teaching aids. But for the 
design of test exercises, not being able to use the normal conventions 
of drawing, such as perspective (Hudson, 1960) or shading (Guds- 
chinsky, 1959), is a severe limitation. There is considerable merit 
in Biesheuvel's (1956) suggestion that pictures should be avoided. 


Methods of Adaptation 


Test instructions. One effective approach is the use of filmed 
pantomime instructions, as in the Adaptability Test (Biesheuvel, 


PAUL A. SCHWARZ 681 


1952a, 1954). This insures uniform explanations irrespective of the 
skill of the examiners, and is especially useful for administrations 
to mixed-language groups. Because black-and-white productions 
have proved as effective as color films (Hudson, 1958), some econ- 
omies can be realized in this respect. But projection does require 
electric power, which is not yet available in many locations. And, 
some evidence suggests that the pantomime technique is not well 
suited to the explanation of more advanced tasks (Loveland, 1953). 

An alternative method is to devise instructions that rely mainly 
on visual aids, articulated models, and demonstration (Schwarz, 
1961). A spoken commentary is used to link these elements, but is 
supplementary to the more dynamic techniques. There is much 
emphasis on practice problems, both to enable overlearning and to 
provide opportunities for obtaining feedback on the comprehension 
of individual examinees. Although it does require up to three days of 
examiner training and an average of about 25 minutes per test, this 
method is effective with groups as large as 150-200 examinees at 
one sitting. 

In the development of instructions for either method, extensive 
pre-tryouts are required. Oral supplements must be developed prac- 
tically one phrase at a time, and modified for each different eul- 
tural setting. The preliminaries as well as the execution are demand- 
ing. But, at least on the basis of experience in Africa, static 
techniques fail. 

Test exercises. Most of the standard aptitude tests require neither 
* Pictures nor words as part of the individual test items. They can and 
should use symbols whose denotations (if any) need not be under- 
stood. 

But, sometimes, the use of pictures must be attempted. A few test 
operations that are better suited to apparatus procedures can be 
reduced to the pencil-and-paper format only through the E of 
Pictorial representation. Some others involve concepts or situations 
that could be described with words, but suggest a pictorial approach 
as the lesser evil? As a result of the perceptual problems enumer- 
ated above, extensive adaptations may be required. 

One approach is to try to structure the task so tha 
of pictures can be used, perhaps in different combinations, to repre- 
posite may 


that a small set 


ы, n 
3 It should be noted that at the higher levels of education the ор 
be true, 


682 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


sent all of the items. When this can be done, correct interpretation 
can then be insured by including specific training on these particular 
pictures in the initial instructions. 

An example of this technique is the adaptation of a test of three- 
dimensional visualization. The standard task is to visualize the 
appearance of a given pattern after it has been folded in designated 
places, and to identify the resulting solid among four or five drawn. 
representations. This many Africans can not do. Analysis of the 
possible explanations suggested that the difficulty might lie not in 
the basic folding operation but in the interpretation of the perspec- 
tive drawings that represent the alternative solutions. On this 
hypothesis, the test was revised so that each printed pattern would 
form one of two differently painted cubes, and so that the instruc- 
tions would center on solid models of these cubes, a pair of which 
are given to each examinee. Excellent measurements and validities 
were thereafter obtained (Schwarz, 1962). 

When it is not possible to represent all of the exercises with a 
small set of pictures, as in a test of concept formation, the approach 
is directed at maximizing the interpretability of the pictures with- 
out specific training. The first step is to assemble a large set of trial 
drawings, and to present these for identification, one at a time, to 
subjects typical of the intended examinee groups. By eliminating 
those drawings that are most often misunderstood, reasonably high 
—but not perfect—accuracy of recognition will be obtained from 
the resulting test. The remaining ambiguities will introduce variance 
irrelevant to the measurements intended, but may not preclude effec- 
tive use of the test for many applications. 

If ‘it seems important to reduce this error still further, a second 
modification can be applied. This is to cue the correct identification 
of the pictures with inputs supplementary to those in the drawings 
themselves, e.g., to speak the name of each object as the item is 
attempted. The combined visual-auditory stimulus will usually 
produce nearly perfect levels of comprehension. Such supplemental 
cueing is especially practical when the test is externally paced for 
reasons discussed in the following section. 

In summary, the communication aspects of the testing process 
require a number of cultural adaptations. It is necessary to make 
drastic changes in the test instructions, relying mainly on dynamic 
techniques; to check the adequacy of all other verbal and perceptual 


PAUL A. SCHWARZ 683 


stimuli; and to modify or supplement deficient elements, as described 
above. 


Individual Differences in Strategy 


The last aspect of the testing process to be considered is the 
judgmental factor that also shapes an examinee's test behavior. This 
emerges in the different tactics he may adopt in fulfilling the 
requirements of the test. He may decide to work slowly to avoid 
mistakes, or sacrifice accuracy for speed; to defer the more difficult 
problems, or puzzle over them a long time; to answer items in the 
order of presentation, or in other logical groupings. African ex- 
aminees, having had little or no prior experience with tests, show 
large individual differences in strategy, which can have a sizeable 
effect оп the resulting scores. Steps must be taken to encourage а 
more uniform approach. 


Time Limits 


The most general source of variation, found with nearly all types 
of tests, is the budgeting of time within fixed limits. This is a new 
problem for the African examinee to which he will, unless guided, 
formulate an individual solution. 

Speed tests. Strategy can be most easily controlled in tests in 
Which the items are so simple that speed is virtually the sole basis 
for discrimination. This is probably because the concept of maxi- 
mum speed is sufficiently concrete to be taught as part of the initial 
instructions. The concept should be illustrated, first, in the ex- 
aminer’s demonstration, by having him work a fairly large sample 
of items at the intended pace; and, again, in the practice exercises, 
by making the time limits here proportionate to those given for the 
actual test. Following such tuition, nearly all of the examinees will 
Work at the peak tempo desired. 

Time-limit tests. When the items require skill or accuracy as well 
as speed, it will be found that uniform strategies can generally not 
be achieved. The standard verbal cues (e.g., “Work quickly but be 
Careful not to make too many mistakes”) tend to confuse inexperi- 
enced examinees, and the lack of tangible criteria precludes effective 
Specialized training. One partial solution, applicable to many of these 
tests, is to do studies to determine the optimum weighting of errors, 
and to depend on this scoring formula rather than external controls. 


684 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Because "test-wiseness" seems to be rather quickly acquired, an 
additional remedy is available whenever a battery of tests is given. 
If these kinds of tests are scheduled near the end of the session, the 
incidental learning from earlier tests eliminates at least the extreme 
strategy variations that would otherwise be obtained. 

Power tests. Tests that are based on difficult items and liberal 
limits of time provide more opportunities for different strategies, and 
therefore result in greater amounts of individual variation. There 
seem to be no easy ways of encouraging more thoughtful responses 
by the examinees who use only a fraction of the available time, or 
of speeding up those for whom execeptionally generous limits are not 
enough. This is the main reason that the frequently used tests of 
abstract reasoning prove unsuitable for the sixth-grade level, and 
that adequate power tests are, in general, the most difficult to devise. 

One cumbersome but effective solution is to pace such tests ex- 
ternally, by having the éxaminer ask the questions at fixed intervals. 
The training value of this procedure is considerable, and will provide 
an important by-product if any paced tests are given first. 


Other Strategy Variations 


Many tests lead to strategy variations that are peculiar to the 
specific operation or format involved. Once these have been deter- 
mined by careful observation during tryout sessions, ad hoc solutions 
сап normally be devised. Usually, lengthening the demonstration 
and practice phases, and alerting the proctors to the needed correc- 
tions will suffice. Occasionally, the lay-out of the test will have to be 
changed to eliminate approaches other than the one intended. These 
difficulties are highly specific, and permit few generalizations. The 
critical step is finding out what the problems are through tryouts 
expressly designed for this purpose. 

With respect to strategy, then, the approach consists of special 
studies to determine the nature of individual variations, and the 
introduction of controls through appropriate test design, tuition, OF 
scoring, depending on the nature of the task. The proper sequencing 
of tasks in a test battery provides cumulative training on judg- 
mental factors, and is an especially productive means of control. 


Implications 


There is much still to be learned about the techniques of overseas 
testing, and the approach described in this paper will no doubt be 


PAUL A. SCHWARZ 685 


elaborated and refined during the next few years. Even in its present 
form, however, it does make possible the construction of instruments 
effective for many problems to the solution of which no measure- 
ments are now being applied. In Africa, it has led to the develop- 
ment of 20 such tests, and to the successful testing of about 15,000 
examinees, 

Because the need for tests as guides to practical decisions and as 
tools enabling more basic research is urgent throughout the develop- 
ing nations, it is hoped that this effort will stimulate similar activity 
also outside the African setting. 


REFERENCES 
Biesheuvel, 8. “Personnel Selection Tests for Africans.” South Af- 


_ rican Journal of Science, XLIX (1952), 3-12. (a) { 

Biesheuvel, S. “The Occupational Abilities of Africans." Optima, II 
‚ (1952), 18-22. (b) "B, 

Biesheuvel, S. *The Study of African Ability." African Studies, XI 
. (1952), 105-117. (c) ; 

Biesheuvel, S. "The Measurement, of Occu RR in a 

en Society.” Occupational Psychology, XXVIII (1954), 
196. 
Biesheuvel, S. "Aspects of Africa." Lecture given on the Third Pro- 
, gramme of the B.B.C., April 19, 1956. 

Biesheuvel, S. "Questionnaire on Psychological Tests: Comments on 
Responses Received.” Paper presented at CCTA conference of 
February 27-March 2, 1962. i 

ommission for Technical Cooperation in Africa. Meeting of Ez- 
perts for the Construction of Vocational Selection Tests. London: 
March 16, 1960. х 

ommission for Technical Cooperation in Africa. Construction of 
Inter-African Selection Tests to Be Used at the End of the Pri- 
mary or Middle Stages. Lagos: June 30, 1961. 

ewet, D. "Two Tests of Implement Manipulation." Journal of the 
National Institute for Personnel Research, VII (1957), 75-77. 

Fenseca, L, and Kearl, B. “Comprehension of Pictorial Symbols: An 

xperiment in Rural Brazil.” University of Wisconsin: Depart- 

F ment of Agricultural Journalism, 1960. 

ontaine, C. Utilisation Pratique des Methodes de la Psychotech- 
mque dans les Pays en Developpement. Tunis: Secretariat d'Etat 
& la Sante Publique et aux Affaires Sociales, 1961. 539 

Gudschinsky, 5. “Recent Trends in Primer Construction." Funda- 

mental and Adult Education, XI (1959), 67-96. М 
ector, H. "Relationship between Paired Comparisons and Rank- 
ings of 7-Squares Test Patterns.” Journal of the National Insti- 

H tute for Personnel Research, VIII (1959), 65-66. ie 

udson, W. “Colour vs. Monochrome in а Demonstration Film Used 
to Administer Performance Tests for the Classification of Af- 
Tican Workers," Journal of the National Institute for Personnel 
Research, VII (1958), 128. ‹ 


686 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Hudson, W. “Pictorial Depth Perception in Sub-cultural Groups in 
Africa.” Journal of Social Psychology, LII (1960), 183-208. 
Loveland, E. *Driver Selection Tests for Non-English Speaking Na- 
tions." Washington: Personnel Research Branch, Department of 

the Army, 1953. 

Morin, J. “Une Etude Psychotechnique du Travailleur Marocain: 
І bee en du Test de Pliage de Fil de Fer.” Journal de Psy- 
chologie Normale et Pathologique, LII (1955), 182-196. 

Romier, P. “Le Recrutement de la Main d'Oeuvre Nord Africaine." 
Bulletin du Centre d'Etudes et Recherches Psychotechniques, 
VII (1958), 221-227. 

Schwarz, P. “Aptitude Tests for Use in the Developing Nations.” 
Pittsburgh: American Institute for Research, 1961. 

Schwarz, P. “Report of Progress in Selection for the Skilled Trades.” 
Lagos: American Institute for Research, 1962. 

Verhaegen, P. “Utilite Actuelle des Tests pour l'Etude Psychologique 
des Autochtones Congolais.” Revue de Psychologie Appliquee, 

NA (1956), 139—151. 

Wickert, Е. “Industrial Psychology in Africa.” American Psycholo- 

gist, XV (1960), 163-170. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 4, 1963 


NEUTRAL ITEMS AS A MEASURE OF ACQUIESCENCE ! 


ALLEN L. EDWARDS 4x» CAROL J. DIERS 
University of Washington 


Tum influence of social desirability upon responses to items in the 
Minnesota Multiphasic Personality Inventory (MMPI) has now 
been documented in a number of studies. It has been shown, for 
example, that scores on MMPI scales are correlated with Edwards’ 
(1957) Social Desirability (SD) scale to the degree to which the 
items in the scales are keyed for socially desirable or socially unde- 
sirable responses (Edwards, 1957, 1961; Edwards, Heathers, & 
Fordyce, 1960). It has also been found that the greater the intensity 
of the social desirability keying of a personality scale, the greater 
the correlation of the scale with the SD scale (Edwards & Walsh, 
1963). First factor loadings of MMPI scales have been found to 
be directly related to the magnitude of the correlation of the scale 
with the SD seale (Edwards & Heathers, 1962; Edwards & Diers, 
1962; Edwards, Diers, & Walker, 1962). The internal consistency of 
an MMPI scale, as measured by the Kuder-Richardson Formula 21 
(KR-21) lower bound estimate, has been found to be related to the 
imbalance in the social desirability keying of the scale (Edwards, 
Walsh, & Diers, 1963). Scales consistently keyed for either 
Socially desirable or socially undesirable responses tend to have a 
Seater degree of internal consistency than scales which have a good 
degree of balance in their social desirability keying. 

Another response set which is believed to be operating with respect 
to responses to MMPI items is acquiescence, or the tendency to 
respond True to an MMPI item (Couch & Keniston, 1960; Ed- 
a, 

Mean study was supported in part by Research Grant M-4075 from the 
tonal Institute of Mental Health, United States Public Health Service. 


687 « 


688 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


wards, 1961; Edwards, Diers, & Walker, 1962; Finney, 1961; Jack- 
son & Messick, 1958, 1961; Messick & Jackson, 1961). In theory, if 
a good measure of acquiescence could be developed and if responses 
to MMPI items are influenced by acquiescent tendencies, then one 
might expect to find various relationships between the measure of 
acquiescence and MMPI scales such as those found between the 
MMPI scales and the SD scale. Unfortunately, there is evidence to 
indicate that no one of the various MMPI measures of acquiescence 
which have been developed is independent of social desirability 
influences. 

Fricke (1956) developed a response bias (B) scale using MMPI 
items of high controversiality, i.e., items for which the probability 
of a True response is approximately equal to the probability of а 
False response, as a measure of acquiescence. Edwards (1961) found, 
however, that the B scale correlated —.59 with his SD scale and, in 
a factor analysis of MMPI scales, Edwards, Diers, and Walker 
(1962) found that the B scale had a loading of —.56 on the first 
factor of the MMPI, a factor which they interpret as a social de- 
sirability factor. Hanley (1961) has also shown that the B scale is 
influenced by social desirability considerations. 

It has been suggested by Hanley (1956) and by Edwards (1957) 
that scales containing only neutral items might be relatively inde- 
pendent of the influence of social desirability tendencies and there- 
fore more susceptible to the influence of acquiescent tendencies. 
However, Diers (1961) found that two experimental MMPI scales 
containing only neutral items were correlated with the SD scale. 
Similar results have been obtained by Crowder (1962). In another 
study, Edwards (in press) found that an experimental scale of 
neutral non-MMPI items was also correlated with the SD scale. 
The items in this scale, in addition to having neutral social desir- 
ability seale values, were also of high controversiality. 

That the correlations found between neutral seales and the 8D 
scale cannot be regarded as resulting from the imbalance in the 
True-False keying of the SD scale is shown by the fact that neutral 
scales in which all items are keyed for either True or for False 
responses have practically zero correlations with social desirability 
scales in which the items are also all keyed for either True or for 
False responses, provided the items in the social desirability scale 
all have high social desirability scale values (Diers, 1961; Edwards, 


EDWARDS AND DIERS 689 


in press). Thus, if neutral scales are measuring acquiescent tend- 
encies, this response set is not reflected in any systematic way in 
responses to items with socially desirable scale values, but only in 
responses to items with socially undesirable scale values. 

Crowder (1962) has suggested that one possible explanation of the 
relationship found between responses to neutral items and items 
with socially undesirable scale values is that the neutral point on 
the social desirability continuum is different for high and low 
Scorers on the SD scale. For example, if a set of items is rated for 
social desirability by a group of High SD (HSD) subjects, the 
neutral point for the HSD subjects may fall to the right of the 
neutral point found for Low SD (LSD) subjects or for unselected 
subjects. Thus, if an item has a neutral scale value when rated by 
unselected subjects, it may have a socially undesirable scale value 
when rated by HSD subjects, In describing themselves with respect 
to items which have neutral scale values, when judged by an unse- 
lected group of subjects, HSD subjects may perceive the items as 
falling below their own neutral point and thus be less likely to 
Tespond True to the items. 

The present study was undertaken to determine whether HSD 
subjects do rate neutral items lower in social desirability scale value 
than LSD and unselected subjects and to investigate further the 
relationship between responses to neutral items in self-deseription 
and scores on the SD scale. 


Method 


Social desirability scale values were available for a pool of 2,824 
experimental personality items (Edwards, in press). These items 
had been rated for social desirability by a group of male college 
students and also by a group of female college students. Social de- 
Sirability scale values were thus available for each sex group. On 
the basis of the female judgments, a set of 229 items which had scale 
Values in the neutral interval, 4.5 — 5.5, on a 9-point social desir- 
ability rating scale was selected. In addition to these 229 items, а 
Set of 50 items ranging in approximately equal intervals from 1.5 to 
8.5 on the social desirability continuum was selected. These 50 items 
Were randomly mixed with the 229 neutral items and printed in а 
test booklet which also included the 39 items in Edwards’ SD scale. 
These are the items which were administered to female subjects in 


690 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


the present study. A comparable test booklet was developed for | 
male subjects. This test booklet contained 244 neutral items, based 
upon the judgments of college males, 50 items approximately eq ly. 
spaced over the social desirability continuum, and the 39 items in. 
the SD scale. | 
At the first testing session subjects were given the appropriate. 
test booklet and they were asked to describe themselves by marking 
each item True or False. At a second session, approximately one 
week later, the subjects were given the same test booklet and at 
this time were asked to rate each of the items on a 9-point social 
desirability scale following the standard instructions described by 
Edwards (1957). ; 
Complete results from both sessions were obtained from 93 males. 
and 87 females in attendance at the summer session of Central 
Washington State College? These subjects were somewhat old 
than the usual college student, the mean age for the females being 
approximately 34 years and the mean age for the males being ? 
proximately 31 years. Each subject’s score on the SD scale 
obtained and a distribution of these scores was made for each 8 
group. On the basis of these distributions, approximately the high 
and lowest thirds were selected to form a High Social Desira! 
(HSD) group and a Low Social Desirability (LSD) group for 
sex. Table 1 gives the number of subjects in each group, the ra 
of their scores on the SD scale, and the mean score on the SD scale. 
for each group. | 
Mean social desirability ratings were computed for all neutral 
items for each of the four groups of subjects. The mean social desir- 
ability ratings of the neutral items were used to construct fout 


i 


TABLE 1 


Number of Subjects, Range of Scores on the SD Scale, and Mean Score _ 
on the SD Scale for each of the HSD and LSD Groups 


n Range X з 
Females HSD 30 33-39 35.40 
LSD 29 15-28 23.72 
Males HSD 29 36-39 37.17 
LSD 32 22-32 29.75 


2 Testing of the subjects was accomplished th th ration of D 
Eldon E. Jacobsen of Central Washington State Collage. ERES і 


EDWARDS AND DIERS 691 


TABLE 2 
Distribution of Neutral Items in the Four Scales for Each Sex Group 


Females Low SD Males Low SD 
B A А 
High A 18 105 High А 11 85 
SD B 48 26 SD B 81 47 


scales for each sex group. Scale AB consisted of items which HSD 
subjects rated above the neutral point, 5.0, and which LSD subjects 
rated below the neutral point on the 9-point social desirability rating 
scale. Scale AA consisted of items which both groups rated above 
the neutral point. Scale BB consisted of items which both groups 
tated below the neutral point. Scale BA consisted of items which 
HSD subjects rated below the neutral point and LSD subjects rated 
above the neutral point. If an item had a mean rating for either the 
HSD or the LSD group below 4.0 or above 6.0 on the 9-point rating 
scale, it was discarded. Table 2 gives the number of items contained 
In each of the four scales for each sex group. The total number of 
True responses to the items in each of the four scales was then ob- 
tained for each subject. 

Social desirability scale values, based upon the ratings of each 
group, were also obtained for each of the 50 items which had been 
| Included because they were approximately equally spaced over the 
point social desirability continuum. In addition, for each of these 
items the probability that it was answered True was obtained for 
tach group. Intercorrelations of the scale values and probabilities of 
endorsement were then obtained for each group of subjects, includ- 
Ing the college group which originally rated the items and whose 
Tatings were used as a basis for selecting the items. 


Results and Discussion 


Table 3 gives the mean number of True responses for HSD and 
18D subjects on each of the four scales along with the standard 
deviations of the scales. In all cases, LSD subjects give more True 
Tesponses to the items in each of the four scales than do HSD sub- 


692 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 3 m 
Mean Number of True Responses and Standard Deviations on the Four — — 
Neutral Scales for Each of the Experimental Groups ÉL 
Females Males 
High SD Low SD High SD Low SD 
Scale X s T s Scale X 8 ® i 
AB 8.70 3.46 9.34 2.47 AB 4.10 1.74 5.16 
AA 52.03 9.52 56.90: 8.23 AA 45.17 9.36 48.53 | 
BB 18.13 5.11 22.41* 3.47 BB 24.55 9.12 34.035 
BA 9.36 3.87 13.79 3.76 BA 16.76 6.96 23.22* 


* Significantly different at P < .05. 


jects. Table 4 gives the mean social desirability rating of the ite 
in each of the four scales and also for the combined scales for 
and LSD subjects. The difference in the over-all mean social des 
ability rating of the items for HSD and LSD subjects is not 
great and this result provides evidence that the differential 
response rates to neutral items for HSD and LSD subjects do n 
occur because HSD subjects rate neutral items as less socially desir 
able than LSD subjects rate them. 
Table 5 gives the intercorrelations between probability of end orse- 
ment and social desirability scale value for the set of 50 i 
which are spaced over the social desirability continuum. The ¢ 
lations are given for both the HSD and LSD groups and also for 
the college group which was used as a basis for selecting the item 
Again, it may be noted that there is little difference between the 
mean social desirability rating of this set of 50 items for HSD ant 
LSD subjects. Furthermore, the intercorrelations of the social de sir- 
ability ratings of the various groups are quite high. For fem 
the LSD correlation between probability of endorsement and s0¢ 


TABLE 4 


Mean Social Desirability Ratings of the Items in the Four 
Neutral Scales and for the Combined Scales 


Females Males 
Scale HSD LSD HSD LSD 
AB 5.31 4.80 5.17 4.83 
AA 5.39 5.42 5.34 5.47 
BB 4.66 4.59 4.55 4.67 
BA 4.74 5.28 4.69 5.27 


EDWARDS AND DIERS 693 


TABLE 5 


Inlercorrelations of Social Desirability Ratings and Probabilities of Endorsement 
P(T) for HSD and LSD Groups and for the College Group Based upon a 
Set of 50 Items Spaced over the Social Desirability Continuum 


Females 
2 3 4 5 6 Mean з 
8D 1 College 05) .94 1:00... 92° RSE БЛ ОШ 
Ratings 2 High SD .98 .95. .98....,81-, 5:20 ОИ 
3 Low SD .94 .93 .81 5.22 1.69 
4 College .93 .80 0.48 0.30 
P(T) 5 High SD .81 0.50 0.32 
6 Low SD 0.54 0.25 
Males 
| 2 3 4 5 6 Mean 8 
b 2. 5 Dg ss 
SD 1 College 97 .98 1000.92 Сн ОЯ 
Ratings 2 High SD 98 .97  .92 .86 4.89 1.56 
3 Low SD 97.92. +88! 5130100 
4 College .92 .87 0.46 0.28 
P(T) 5 High SD .87 0.42 0.31 
6 Low SD 0.48 0.27 
BEES.  OLow9D ee sS 


desirability scale value is considerably lower (r — .81) than it is for 
HSD subjects (r = .93). The difference between the correlations for 
LSD males and HSD males is not as great and this is undoubtedly 
because the LSD males are much closer in average score (29.75) on 
the SD scale to the average score (37.17) of the HSD males. 

The results reported above show that HSD subjects give fewer 
True responses to neutral items than LSD subjects, but there is 
little evidence to support the notion that they do so because they 
Tate neutral items as less socially desirable than unselected or LSD 
Subjects. An alternative explanation for the differential True re- 
‘Ponse rates to neutral items for HSD and LSD subjects is to be 
found in the difference in the regression lines of probability of 
endorsement on social desirability scale value for the HSD and 

D groups. The regression equation of probability of endorsement 
On social desirability scale value was obtained for the female LSD 
Broup and also for the HSD group. These two regression equations 
Were then used to find the point on the social desirability continuum 
at which the LSD probability of a True response was equal to the 

SD probability of a True response. The point of equality was 5.8, 
Whereas the neutral point on the social desirability continuum is 5.0. 

For all items with scale values to the left of 5.8 on the social 


0% EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


| 
desirability eontinuum, the LSD group has an expected higher | 
probability of a True response than does the HSD group, whereas 
for items to the right of 5.8 the HSD group has a higher expected 
probability of a True response than does the LSD group. The cor- 
responding point on the social desirability continuum for LSD 
males and HSD males is 5.7. Now, since the neutral interval is 
from 4.5 to 5.5, it is clear that if a set of items with neutral seale 
values is selected for inclusion in a scale to measure acquiescent 
tendencies, then, in general, LSD subjects will give more True 
responses to these items and therefore have higher scores on this 
scale than HSD subjects. We would thus expect True scores on any 
such set of neutral items to be negatively correlated with scores on 
the SD scale. 

As a demonstration of the correlations to be found between scores 
on neutral scales and scores on the SD scale, the scores of all 87 
females on the four neutral scales and the scores of all 93 males on 
the four neutral scales were correlated with scores on the SD seale. 
These correlations are given in Table 6, along with the correlation 
of the total number of True responses on all four scales with the 
SD scale. For both females and males Scale AA has a lower correla- 
tion with the SD scale than Scale BB. Scale AA contains items with 
scale values above the neutral point for both groups and Seale BB 
contains items with scale values below the neutral point for both 
groups. Thus, the mean social desirability rating of the items in 
Scale AA is closer to the point of predicted equal probability of § 
True response than the mean social desirability rating of the item* 
in Scale BB. In other words, the two regression lines are farther 
apart at the point on the social desirability continuum represen 


TABLE 6 


Correlations of Scores on the Four Neutral Scales and of the Total Score 
on all Four Neutral Scales with Scores on the SD Scale 


Females Males 
п = 87 п = 93 
we „э 

АВ —.12 —.29 
AA —.24 —.27 
ВВ —.38 —.52 
ВА —.46 —.54 
Total —.41 —.48 


EDWARDS AND DIERS 695 


he mean social desirability rating of the items in Scale BB and 
ser together at the point represented by the mean social desir- 
bility rating of the items in Scale AA. Thus, for the items in Scale 
difference between the expected probability of a True re- 
for LSD subjects and HSD subjects is greater than it is for 
üs contained in Scale AA, and Scale BB correspondingly has 
ther correlation with the SD scale than Scale AA. 

results described above show that if one attempts to develop 
to measure acquiescent tendencies by selecting items with 
fal scale values for inclusion in the scale, then scores on the 
may be expected to correlate negatively with scores on the 
D scale. However, the data suggest that the number of True re- 
sponses to a set of items with scale values falling between approxi- 
lately 5.7 and 5.8 on the social desirability continuum would be 
‘atively uncorrelated with scores on the SD scale, since it із at 
lis point on the social desirability continuum that the probability 
| а True response for HSD and LSD subjects may be expected to 
approximately equal. 
om the pool of 2,824 experimental personality items an attempt 
fas made to select a set of items which would have scale values 
ose to 5.75 on the social desirability continuum, based upon the 
bmbined judgments of male and female college students. The value 
5.75 was taken as a compromise between the point of expected 
probability (5.8) of a True response for females and the 
of expected equal probability (5.7) of a True response for 
The set of 40 items thus selected had a mean rating of 5.79 
or the combined group of judges, a mean rating of 5.69 for females, 
& mean rating of 5.89 for males. Thus, the mean social desir- 
rating of the items in this scale is slightly above the point of 
Peeted equal probability of a True response for males and slightly 
ow the corresponding point for females. 
1e number of True responses to the set of 40 items was obtained 
Sample of 110 male college students and this score was then 
lated with scores of the same subjects on the SD scale. The 
ulting correlation coefficient was .088. The mean score on the 
experimental scale was 24.16 and the standard deviation 
Was 4.41. The KR-21 lower bound estimate of internal consistency 
3 52. It is, of course, not known whether the number of True 
Ponses given to the items in the scale is, in fact, a measure of 


696 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


acquiescence or not. It is, however, a scale in which the tendency to 
respond True is relatively uncorrelated with the tendency to give 
socially desirable responses as measured by the SD scale. 

In the case of any new technique, the results obtained the first 
time the technique is applied may be purely accidental. As a check 
upon the applicability of the technique used in deriving the experi- 
mental scale, Gocka and Holloway replicated this portion of the 
present study using a sample of male VA patients and MMPI items? 
They selected 51 MMPI items with scale values approximately 
equally spaced over the social desirability continuum. The regres- 
sion equations of probability of a True response on social desir- 
ability scale value for this set of 51 items were then obtained for a 
HSD and a LSD group. The point of expected equal probability of 
a True response was 6.0. They then selected a set of 40 MMPI 
items with scale values of approximately 6.0 and correlated the 
number of True responses on this experimental scale with scores on 
the SD scale. For a sample of 220 male patients the resulting corre- 
lation coefficient was .086. The mean score on the 40-item experi- 
mental scale was 22.08 and the standard deviation was 6.33. The 
KR-21 lower bound estimate of internal consistency of the experi- 
mental scale was .77. Thus, the technique described in the present 
study for developing a scale in which all of the items are keyed 
True and such that scores on the scale are relatively uncorrelated 
with scores on the SD scale appears to have some generality. 


Summary 


Male and female subjects were asked to describe themselves in 
terms of а large pool of personality items with neutral social desir- 
ability scale values, 50 items which were spaced over the social 
desirability continuum, and the 39 items contained in Edwards’ SD 
scale. Approximately one week later the same subjects rated the 
same items on a 9-point social desirability scale. High (HSD) and 
Low (LSD) groups were selected on the basis of their scores on the 
SD scale. 

The HSD group gave fewer True responses to neutral items than 
the LSD group. For the complete sample of 87 females the correla- 
tion between the number of True responses and score on the 8р 


* Е. Е. Соска and Hildegund A. Holloway, personal communication, 1962. 


EDWARDS AND DIERS 697 


scale was —.41 and the corresponding correlation for the complete 
sample of 93 males was —.48. 

The social desirability ratings assigned to the pool of neutral 
items by HSD and LSD groups did not differ greatly on the average. 
This finding is interpreted as supporting the notion that the differ- 
ential True response rates to neutral items by HSD and LSD groups 
is not the result of a difference between the groups with respect to 
where they locate the neutral point on the social desirability 
continuum. 

The regression equation of probability of endorsement on social 
desirability scale value, based upon a set of 50 items spaced over 
the social desirability continuum, was obtained for both the LSD 
and HSD group. These two regression equations were then used to 
find the point of predicted equal probability of a True response. 
This point was found to fall outside the neutral interval on the social 
desirability continuum. A new scale was developed in which all of 
the items contained scale values at approximately the point of 
predicted equal probability of a True response. The number of True 
responses to the items in this scale was found to correlate only .088 
with scores on the SD scale in a new sample of 110 males. Using 
the same technique, Gocka and Holloway developed a scale using 
MMPI items and male patients at a VA hospital. The number of 
True responses to the items in the MMPI scale correlated .086 with 
Scores on the SD scale in a sample of 220 male patients. 


REFERENCES 


Couch, A. and Keniston, К. “Yeasayers and Naysayers: ТЕПЛЕ, 
име Set as a Personality Variable: Journal of Abnorma 
and Social Psychology, LX (1960), 151-174. 

Crowder, Patricia. “An Émpirical Investigation of the Effects of 
Varying Initial Rates of True Responses on Subsequent Rates. 

_ Unpublished Master's thesis, University of Washington, 1962. 

iers, Carol J. “Social Desirability and Aequiescence in Response to 
Personality Items.” Unpublished Ph.D. thesis, University of 
Washington, 1961. à 

Edwards, A. L. The Social Desirability Variable in Personality As- 
sessment and Research. New York: Dryden Press, 1957. 
wards, A. L. “Social Desirability or Acquiescence in the мы? 
=, Case Study with the SD Scale.” Journal of Abnormal a 

ocial Psychology, LXIII (1961), 351 . y r 
dwards, A. L. “А Factor Analysis of Experimental Social Desir- 
ability and Response Set Scales." Journal of Applied Psychology, 
in press, 


698 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Edwards, A. L. and Diers, Carol J. “Social Desirability and the 
Factorial Interpretation of the MMPI.” EDUCATIONAL AND Psy- 
CHOLOGICAL MEASUREMENT, XXII (1962), 501-509. 

Edwards, A. L., Diers, Carol J., and Walker, J. N. “Response Sets 
and Faetor Loadings on Sixty-One Personality Scales." Journal 
of Applied Psychology, XLVI (1962), 220-225. 

Edwards, A. L. and Heathers, Louise B. “The First Factor of the 
MMPI: Social Desirability or Ego Strength?" Journal of Con- 
sulting Psychology, XXVI (1962), 99—100. 

Edwards, A. L., Heathers, Louise B., and Fordyce, W. Е. “Correla- 
tions of New MMPI Scales with Edwards’ SD Scale." Journal of 
Clinical Psychology, XVI (1960) , 26-29. 

Edwards, A. L. and Walsh, J. A. “The Relationship between the 
Intensity of the Social Desirability Keying of a Scale and the 
Correlation of the Scale with Edwards’ SD Scale and the First 
Factor Loadings of the Scale.” Journal of Clinical Psychology, 
XIX (1963), 200-203. 

Edwards, A. L., Walsh, J. A., and Diers, Carol J. "The Relationship 
between Social Desirability and Internal Consistency of Per- 
ШАШУ Scales.” Journal of Applied Psychology, XLVII (1963), 

Finney, J. C. "The MMPI as a Measure of Character Structure as 
Revealed by Factor Analysis." Journal of Consulting Psychol- 
„оду, XXV (1961), 327-336. 

Fricke, B. G. "Response Set as a Suppressor Variable in the OAIS 
me ME, -" Journal of Consulting Psychology, XX (1956), 

Hanley, C. “Social Desirability and Responses to Items from Three 
MMPI Scales: D, Sc, and K.” Jo 7 cholo 
XL (1956), 324-358. urnal of Applied Psy gy, 

Hanley, C. “Social Desirability and Response Bias in the MMPI.” 
Journal of Consulting Psychology, XXV (1961), 13-20. 

Jackson, D. N. and Messick, 8. “Content and Style in Personality 
Assessment.” Psychological Bulletin, LV (1958), 243-252. 

Jackson, D. N. and Messick, S. “Acquiescence and Desirability as 
Response Determinants on the MMPI.” EDUCATIONAL AND PsY- 
CHOLOGICAL MEASUREMENT, XXT (1961), 771—790. 

Messick, 8. and Jackson, D. N. “Acquiescence and the Factorial 


Interpretation of the MMPI.” ; В VIII 
(1961), 299-304. -” Psychological Bulletin, L 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 4, 1963 


GOOD IMPRESSION, SOCIAL DESIRABILITY, AND 
ACQUIESCENCE AS SUPPRESSOR VARIABLES 


CHARLES DICKEN 1 
University of Chicago? 


Tus study uses social desirability and acquiescence as suppressor 
variables for the California Psychological Inventory (CPI) (Gough, 
1957). 

A suppressor variable is defined as one which is significantly 
associated with a predictor, but essentially unassociated with the 
criterion for which the predictor is valid. When these relationships 
hold, validity can be improved by accounting for a portion of the 
variance of the predictor which is not associated with the criterion 
(Lubin, 1957; McNemar, 1945). Lubin expresses this increase in 
validity in terms of the relationship of т to Rove where c is the 
criterion, v a valid predictor, s a suppressor variable, E the multiple 
correlation coefficient. He shows that when fs is approximately 
equal to zero, r,, must exceed .40 for a predictive gain of 10 per cent. 
In the саве of a suppressor variable, the regression weight of s (84) 
is negative. If r, is positive, the suppression effect tends to be viti- 
ated unless r,, is large enough to merit using s as simply another 
valid predictor (8, positive). Intermediate levels of тзг yield no pre- 
dictive gain. Lubin also distinguishes the “negative suppressor,” а 
ы 


1 The author is indebted to Harrison Gough, Donald MacKinnon, and Wal- 
lace Hall of the Institute of Personality Assessment and Research, Berkeley, 
for furnishing the data on which this study was based. Thanks are also due 
Donald Fiske, Lew Goldberg, Harrison Gough, Ardie Lubin, and Jerry Wig- 
Eins, who read a preliminary version of the report and furnished valuable 
comments and criticism. y 

? The project was initiated while the author was a Research Associate at 
Stanford University. 


700 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


variable whose correlation with v is opposite in sign from its va- 
lidity. 

'The suppression paradigm prompts search for variables related to 
criterion-irrelevant, “unwanted” predictor variance. Response sets 
or styles such as the “fake-good”-“fake-bad” dimension (Meehl & 
Hathaway, 1946), social desirability (Edwards, 1953, 1957b) and 
acquiescence (Cronbach, 1946; Fricke, 1956b; Jackson & Messick, 
1961) have often been cited as potential contributors of irrelevant 
variance to personality inventories. Edwards and his associates 
(Edwards, 1953, 1957b, 1962; Edwards & Walker, 1961) have in- 
terpreted positive correlations between item desirability and item 
endorsement frequency and correlations between personality scores 
and the scales’ or respondents’ desirability level as indicating that 
the social desirability factor is highly significant. Factorial studies 
of the MMPI (Fordyce, 1956; Jackson & Messick, 1961; Wiggins, 
1962) have been interpreted as indicating substantial components of 
both desirability and acquiescence variance. 

Some attention has been given to response sets as personality 
dimensions in their own right (Berg, 1955; Couch & Kenniston, 
1960; Meehl & Hathaway, 1946). Gough (1957) argues that em- 
pirical construction of self-report devices minimizes the problem of 
invalid response set variance since criterion-relevant aspects of re- 
sponse sets tend to be included and criterion-irrelevant components 
excluded. 

Whatever the magnitude and relevance of response set variance, 
there is little doubt that individual differences in set can be demon- 
strated and reliably measured. Wiggins (1962) cites 11 measures 
of desirability responding and seven measures of acquiescence. 
Social desirability scales have been rationally constructed from 
item desirability judgments and empirically constructed by con- 
trasting responses under standard and role-playing instructions. 
Acquiescence scales key all “true” items with heterogeneous or non- 
discriminating content or items selected for high “controversiality” 
(middle-range endorsement frequency). There is also little doubt of 
the power of measures of the desirability type in detecting records 
of Ss who deliberately misrepresent themselves (Dicken, 1960; 
Gough, 1947; Wiggins, 1959). 

If response sets contribute substantial amounts of invalid variance 
to personality scores of typical assessment subjects, then measure- 


CHARLES DICKEN 701 


ment of the set component and nullification of its effect by the sup- 
pression technique should increase validity. The suppressor model is 
implicit in the forced-choice social desirability control proposed by 
Edwards (1957a). McKinley, Hathaway, and Meehl (1948) re- 
ported increases in the concurrent validity of five MMPI clinical 
scales by using K, an empirically-derived favorability bias measure, 
as a suppressor. However, later studies of MMPI scales with and 
without the K-correction have shown no differences in corrected arid 
uncorrected scores (Tyler & Michaelis, 1953), insignificant gains in 
validity (Hunt, Carp, Cass, Winder, & Cantor, 1948; Schmidt, 
1948), mixed gains and decrements (Monachesi, 1953), or consistent 
decrements in validity (Fulkerson, Freud, & Raynor, 1958). Sup- 
pressing acquiescence yielded no gain in validity for MMPI Hy 
(Fricke, 1956a) but did result in significant though small gains for 
а teacher effectiveness scale (Fricke, 1956b) and an MMPI adjust- 
ment key (Fulkerson, 1958). 

The CPI is especially suitable for testing the effectiveness of sup- 
Pression of response set variance. It measures socially favorable 
traits but lacks explicit control for desirability bias. Imbalance of 
true-false keying of some of the scales leaves it open to acquiescent 
bias. There is considerable validity data for the instrument in its 
present form (Gough, 1957, references revised, 1960). 


Development of Response Set Measures 
Good Impression (Gi) 

Gough (1952) asked high school Ss to respond to 115 specially 
written items under standard instructions and again under instruc- 
tions to create “an exceptionally favorable impression on an im- 
portant person . . . the best picture of yourself.” Significant endorse- 
ment frequency shifts occurred for 40 items, which were keyed for 
the Good Impression (Gi) scale in the direction of higher endorse- 
ment in the role-playing condition. Later studies (Gough, 1957; 
Dicken, 1960) indicate Gi is highly effective in identifying CPI 
Protocols consciously biased in a favorable direction. Since more 
than three-fourths of the items on Gi are keyed false, it is possible 
that acquiescent tendencies as well as favorability bias are reflected 
(Jackson, 1960). Fricke (1956b) and Hanley (1956) contend that 
ш К, keyed 29/30 false, reflects acquiescence as well as favor- 
ability. 


702 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Social Desirability (SD) and Acquiescence (Аса) 


A CPI key scorable for independent components of social desir- 
ability and acquiescence was derived according to Hanley’s (1957) 
rationale. Hanley reasoned that an ideal desirability scale would 
consist of items for which endorsement frequency (communality) 
and item desirability are negatively related for “honest” subjects, 
i.e., items of high desirability which are not endorsed by unbiased 
Ss. Biased Ss would score high, “honest” Ss low on such a key. 

Unfortunately, item desirability and item communality tend to be 
positively correlated for most items and subject groups (Edwards, 
1957b). Both “honest” Ss who in fact possess the traits to which 
desirable items refer, and “biased” Ss who falsely claim the same 
traits may earn relatively high desirability scores. This expectation 
is supported by Wiggins's (1962) finding of high average desirability 
scores and low variability on Edwards (1957b) SD scale, in which 
item communality and item desirability are highly correlated. Wig- 
gins's desirability key (Sd), composed of items with a low desir- 
ability-communality correlation showed greater variability and was 
more effective in screening desirability role players. 

Hanley’s procedure for selecting items with a low relationship 
between communality and desirability was used in the present study. 
A desirability scale so composed should yield high scores for desir- 
ability biased Ss, moderate scores in “honest” Ss, and low scores in 
“undesirability” biased Ss (malingerers). Hanley suggested that 
the usual high positive relationship of item desirability and item 
communality can be lowered by restricting the range of the latter. 
Items varying in desirability but falling in the middle endorsement 
frequency range (e.g., 36 per cent—64 per cent) are sought. Such 
an item pool also has properties suitable for an acquiescence key 
provided the desirability variable is counterbalanced. 

; For most normative samples, roughly 25 per cent of the 480 CPI 
items fall in the 36 per cent—64 per cent communality range 
(Gough, unpublished data). Communalities for Gough's high-school 
male (N—527) and high-school female (N=510) normative samples 
were used to identify 117 items which fall within the (36 per cent— 
64 per cent) range in both samples and which show no sex differ- 
ences greater than 15 percentage units. Two clinicians rated these 
items for the social desirability of a “true” response. An excess of 


CHARLES DICKEN 703 


potentially “true-undesirable” items in relation to “true-desirable” 
items was reduced by random elimination from among the first 
item type, leaving 91 items. These were submitted to 41 male and 
45 female high-school student judges for desirability ratings. 

Each item was rated separately by each judge on a 7-step scale 
from “true” socially undesirable to “true” socially desirable. “So- 
cially desirable" was defined as “what people in general or society 
believes would be a good thing for a person to say about himself or 
herself in answering questions like this." Median SD ratings were 
computed for each item. Sixty-one of the items fell in the "neutral" 
desirability range (medians 3.0 to 4.99). There were 16 items with 
medians in the range 1.0 to 2.99 (true-undesirable) and 14 items in 
the range 5.0—6.99 (true-desirable). Two items with medians be- 
tween 4.9 and 5.0 were added to the true-desirable group, yielding а 
32-item key (SD) with 16 items keyed true and 16 items keyed 
false for desirability responding. Scale Acq consists of all 32 items 
keyed “true,” half the Acg-keyed responses being “desirable,” the 
other half “undesirable.” Six items, all keyed false, are common to 
SD and Gi and common with inverse keying to Acq and Gi. 

The median desirability values for the 32 CPI items obtained 
from the high school male judges correlate .50 with endorsement 
frequencies from Gough’s normative high school males and .39 with 
endorsement frequency among 313 normative college males. Desira- 
bility medians from the high school female judges correlate .32 with 
endorsement frequency in Gough’s normative high school females 
and .60 with endorsement frequency in 375 normative college fe- 
males. These communality-desirability correlations are considerably 
lower than those reported for Edwards's (1957b) SD scale (r=91), 
indicating at least partial satisfaction of Hanley’s rationale. 

Reliability estimates for SD and Acq appear at the lower left of 


ee . 

з High-school student judges were used in view of the large high-school 
samples available for testing the suppressor keys. The generality of the judg- 
ments for other populations is open to question, although high agreement on 
desirability ratings in diverse rater groups has been demonstrated (Edwards, 
1057b; Klett & Yaukey, 1959). The similarity of the relationships of the Gi 
and SD scales to each other and to the predictor and criterion measures in 
both the high-school and non-high-school samples in the present study (Tables 

and 3) also argues for generality of the judgments. ^ 

„CPI item numbers ‘published version) for the SD-Acq scale and keying 
direction for desirability are: True 52, 95, 97, 108, 135, 140, 152, 168, 242, 246, 
276, 347, 354, 380, 389, 473; False 7, 44, 67, 70, 81, 101, 194, 219, 231, 233, 270, 
298, 331, 335, 375, 462. 


'epdures MO *ётәўвї 9 өл s197B1 g ‘ordures JOS pue (89880 OF) duress HNH ‘SIVI € GA 819781 с tojdures YJ ‘SIVI Ф SA 819191 Ф :вүпәтәщәоэә j[eq-m]ds рәуәәллогу q 
"ApeAnjoedsez 'syuvoqddv үвәтрәш 01 pus ‘вәүешәу әзәүүоә 001 'eepeur 
әЗәпоә 001 шозу вәпүвл Jed poioo1roo әге Doy pus (S 20 soywurneg ‘49 10 PUU sopwos 1oyorpoid 103 (L961 "QSnop) sopvur повы о ‘SAUP 12-2 “INMY + 
Lg “60° ‘eg 00°T ze boy әәпәәвәт\Бәү 
29° ‘ET ‘Z9" 08" ze as AyrIquarsop [Rog 
18° ez OF 15 uorsse1dur poor) 
(=) se[quueA 1osseuddng 
Ce — — — шә Ayrurururo,T 
= d LL — овер &үгацпәв®рү [3 oF 88 aq Ayrurumuo 
$8: <> 98° — Xo 200 Жиптхәр әлтүтиЗоу 
== 29/5 00 pany APRIL 6r ©0` сс rq Aqa 
9L LL 88 94 шор пу әопәјәйшоо [enjoo[[ojuT 08: FAE ze ay Aouerogje үєпўәәцәзиү 
— €08 ¥8° 750/2 вашу Aqrarsndu[ 98° aT 08 25 1013000 jog 
— AL FL S89 —suodsoy AyiIqisuodsosz 9g 96" z әу Aymqisuodsoyr 
=. — 1 2V gos 9ouudooo* Jog TZ’ OF FE PS 90uvjdooo€ y[og 
= — — W se 00g әопәвәла [EOS 08^ ep 98 45 әәпәвәл@ [EOS 
— — -— — mag uoysdionaed epog LN 19° 9g fg 11991008 
a, = be: = 68; - 98: шшо әдивитшо(т 08^ [^ oF oq 9ouvuruio(q 
MO IOS ONG аяй  "qqy Зин вәүвш}вә ana} вшәп ау epeog 
— ———————— »Хүпчзцән pəsə ЈО “ON 
SBU AHHH чолодо 
(9) sopqurivA попәўигу (а) sepquueA TIO 
ғәрдоыр д 4оззәлййту рир “иоыәуыгу ‘ILO fo sopnutjsg зрә pup uoipoysjuopr 
т GISYUL 


704 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT _ 


CHARLES DICKEN 705 


Table 1. Response set keys of the Hanley type would presumably 
have zero internal consistency in a population composed entirely of 
Ss not prone to social desirability or acquiescence. Large internal 
consistencies would be expected in subject groups with wide individ- 
ual differences in desirability bias or acquiescence. Hanley's MMPI 
desirability key, Sz, which is logically similar to SD, has a KR 20 
coefficient of .31. His АТ, logically similar to Acq, has a KR 20 
of .32. (These coefficients are .44 and .46 if adjusted for length to 
SD and Acq.) Low internal consistencies like these are what would 
be expected in populations with some Ss affected and some unaf- 
fected by response sets. SD is somewhat more internally consistent 
than Sx, which may reflect a greater communality-desirability asso- 
ciation or a higher proportion of biased respondents. The reliability 
estimates for Acq approximate those for Hanley’s AT. 

The intercorrelations of Gi, SD, and Acq in six samples are shown 
in Table 2. Item overlap determines a correlation of +.17 between 
Gi and SD correlation of —.17 between Gi and Acq (common- 
elements correlation, McNemar, 1949, p. 117-118). Considering the 
reliabilities, the two desirability measures correlate enough to indi- 
cate that the empirical and rational and construction procedures 
give functionally similar measures. The Gi-Acq correlations suggest 
a small acquiescence component in the former. Scales SD and Acq 
are insignificantly correlated in these samples. The method of key- 
ing these scales restricts their potential intercorrelation although 
it does not require that it be zero, Dr. A. Lubin has pointed out to 
the author the following relationship. “Let 16 true-item scores = Ё, 
let 16 false-item scores = f. Then SD = t + f, ACQ =t + (16—f). 
The correlation between SD and ACQ must be zero if s; = sy. If 
8: > ву, the correlation will be positive; if se < sy the correlation 
will be negative.” 


TABLE 2 
Intercorrelations of Suppressor Scales in Six Samples 


М rai-sp TGi-Aca TSD-Ae 


High School Males 120 74 -27 —02 
High School Females 123 80 -25 -01 
Medical Applicants 70 78 —16 14 
Student Engineers 66 65 -36 —20 
Research Scientists 45 65 —26 Й 


Ll Coleen Wonn БЫ ДО шл шш: 


‹ 


706 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Testing the Suppressor Paradigm 
Samples and Criteria 


The right-hand half of Table 1 summarizes the behavior rating 
criteria used in the six samples studied and shows rating reliability 
estimates from four of the samples. Names, abbreviations, scale 
lengths, keying proportions, and reliability estimates for CPI scales 
selected for pertinence to the criteria appear in the left half of the 
table. Lubin’s notation v (predictor), с (criterion), and в (suppres- 
sor) is used here and in subsequent tables. 

High-school cases. These were drawn from Gough’s (1957) CPI 
cross-validation cases. Four schools are represented. The total sam- 
ples from these schools contain 519 males (HSM) and 543 females 
( HSF). Selecting each third case from the junior-senior students in 
each school gave samples of 120 males and 123 females on which all 
Tos (Pearsonian) correlations in the high-school data are based. 

The trait criteria for the high-school cases are principals’ and 
assistant principals’ nominations of students judged extremely high 
and extremely low in each characteristic. The proportion of the 
students nominated at each extreme varies from 8 per cent to 9 
per cent, proportions to be nominated having been specified by 
Gough. Thus from 82 per cent to 84 per cent of the sample falls in 
the unnominated middle range with respect to each trait. Typically 
a single rater nominated an S, so no inter-rater reliabilities are 
available. Correlations involving the criterion in the high-school 
samples (r,, and т) are biserial, computed by the wide-split form- 
ula which accounts for the omission of the middle cases (Peters & 
Van Voorhis, 1940, p. 385; see also Dicken, 1961). For purposes of 
the multiple regression analysis to follow, these biserial correlations 
аге treated as estimates of the Pearsonian correlation between the 
test scores and а continuous criterion, 

The analyses for the Dominance, Social Participation, and Im- 
pulsivity criteria are based on juniors and seniors only. There were 
32 high and 32 low nominees in each sex sample. The Social Pres- 
ence, Self-acceptance, and Responsibility criteria are based on 45 
high and 45 low males for each of the first two criteria, 44 high and 
44 low males for the last, and 44 high and 44 low females for each 
criterion. The .05 significance levels for r cited in Table 3 for the 
high-school samples are based on Peters and Van Voorhis’s error 


CHARLES DICKEN 707 


estimate for the widespread biserial with the smallest number of 
cases. The .05 level for the Pearsonian value based on the 120 and 
123 cases used in computing Tys is the same (r—.17). 

IPAR cases. Male medical applicants (MED), engineering stu- 
dents (ENG), research scientists (SCI), and female college students 
(CW) assessed at the Institute of Personality Assessment and Re- 
search, Berkeley, (IPAR) were also used. Criterion rating scores for 
the traits are the averages of ratings of independent observers. The 
subjects’ behavior was observed during comprehensive, two-day 
assessments. Procedures of the IPAR assessment program have been 
described in detail elsewhere (MacKinnon, Crutchfield, Barron, 
Block, Gough, & Harris, 1958). The subjects’ behavior in group 
discussions, group problem-solving, laboratory and “stress” assess- 
ment procedures, informal social interactions, ete. were observed by 
the raters, who recorded their judgments at the close of the two-day 
period. Ten to 15 Ss were assessed during any given session. The 
MED rating scores are averages from seven independent raters. Ten 
raters were used for the SCI sample and for 40 of the ENG cases, 
15 raters were for the remaining ENG cases, and 12 raters for the 
CW sample. Ratings for the MED, ENG, and CW cases were made 
with reference to a five-point, quasi-normal distribution and then 
averaged for each subject; the SCI cases were ranked in subsamples 
of 15 and average rating scores derived from the ranks. 

The trait definitions used by the IPAR raters appear below. 


Dominance. Personal ascendance in relations with others (reso- 
lute, self-assured, forceful, not easily intimidated, authoritative). 

Responsibility. Willingness to accept the consequences of one’s 
own behavior; dependability, trustworthiness, sense of obligation to 
the group. (This need not require the person to assume leadership 
or direction of group activity.) 

Impulsivity. Inadequate control of impulse; lacking in self- 
discipline: self-centered, quick-tempered, and explosive. 


Intellectual. competence. The capacity to think, to reason, to 
comprehend, and to know. 
Rigidity. Inflexibility of thought and manner. Stubborn, pedantic, 
unbending, firm. 
anner of 


Masculinity. Characteristically maseuline in style and m 
behavior; self-sufficient; not sentimental or romantic; strong. 


7 
| 
i 
z 
Ё 
2 
3 
3 
В 
: 


OA — — 6$ zw- OA — — i$  90- 20— suodsay 
OS <0 09 9+ 98 SA £0 I9 92 ze SF uruo(q 
(Fe = 4 0L = N) 
epdureg CIN 
SN *+90 6 12— 10 SN *+90 85 98— 10 FI віш 25— 
os 00 62. 292. «8% SA <0 Р 9%  & 88 suodsewp əy 
(SA)OA 00 9r 9% $0 (SA)OA 00 9t 0 20 91 ооу peg PS 
d +70 & 68 95 SA 10 Ie - 05/3. 91 8I Seipoog dg 
os 10 6g £9 85 os 00 $e & ЗІ se зва 205 Лу 
08 то 1g- 99- = "ft SA 10 1g 88 0 9g umm oq 
(LE = fu VN) 
o[dureg ASH 
os 20 t6. 99— 25— os 00 61 —Is—: pi~- 8ї smdu[] 25— 
SA 00 СР cc 9I S +460 9 29 F0 SF suodsoy ay 
SA 10 ©--80 o> 97 SA 00 сс 81— £0 Te ооу JS — vg 
SN «£0 96 zs— 60 SA 00 162-662 00=7 18 вәл 205 dg 
os 00 02 ZF $ SA 10 06 95  10— 61 зва 205 Лу 
os 00 we... 0» — SI SA 00 ze 61 - £0— ee щшот oq 
(ql = fuv) 
ardues WSH 
epo] 4s — y "y tu T" epo 4 — gay "vxo t EM P 2 " 
шер шер 
(ғ) as (в) 29 


 ————Є—Є—Є—Є—ЄЄї—‏ پپپ ڪڪ 


(ж) #ә]{0ыю 4 sossouddng sD (qug) Пїїфюл1зә(] 101905 pun (19) мозззәлйш] poop 
© TI4VL 


IO > d'4« yee 
со >4ч<у+ 
“ухәў әәв :ozrs o[durws вош по рәве q z 


"sesA[wuy [oouog HAH ur posn S, 10} 3x93 008 w 
(SA)OA 00 oz 21— 90— (SA)OA 10 їс 80— 90 6r wg ay 
d *80 6€ 00 6z- d *80 se 00  sc— 9% xz 300 ту 
(SADA £0 ie SI gr~ (SA)OA 90 92 90  £$2— 9c шшот oq 
(82° = %4 ‘Ig = N) 
e[dureg MO 
SA то ж OI I~ SA 10 FF 80— TI e SBW эў— 
(SA)OA 10 % et 20- (SA)OA ©0 8с g=- g= ec PAM 24— 
(S) £0 GF v 20 SA 10 vb 8c #0 Te dwom ay 
(OS)OA 20 Ig F> w- (OS)OA 10 6б 99— 92— g ваші — og— 
9 SA 00 8t 75^. 780 SA 00 S£ 9% 90 se suodsoy э 
A (SA)OA 90 €t? 99  00— (SA)OA #0 9% от  ST— 9 wwoq oq 
a (0g = %4 ‘gp = N) 
gidureg TOS 
5 (SA)OA 10 PA IE (SADA 00 IT 6б— 00 I sW ə- 
(SA)OA 10 $e; > FER SA 90 oe: 902 == PAM 24— 
SA 00 we ете (SA)OA 10 ee ogi OF ze шоо зр әј 
OA — = > Sh OA — — $4— 60 90— віш — og— 
OA a Fre s DS PE (SA)OA — — 8 60 £0— suodsay ay 
os 00 99 Lb 9 SA 00 99 91 60 9¢ шшот oq 
(yg = %4 ‘99 = N) 
epydureg ONT 
EU TT (ew “шлш еу "» "X Wow "u-gp а 
uper) umo 
(8) as (5) 35 


penurjuo/)—8$ WIAV.L 


— —————— ——————————————— ——— M? 


710 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Cognitive Flexibility. The ability to shift and to adapt, and to 
deal with the new, the unexpected, and the unforeseen. 
Femininity. Not further defined. 


The IPAR rating reliabilities (Table 1) indicate substantial inter- 
judge agreement and relatively high stability for the composite 
ratings. The majority of the raters were psychologists highly experi- 
enced in assessment, and all raters were extensively pre-trained with 
the trait definitions. Ratings of favorable traits share common or 
halo variance (MacKinnon, et al, 1958). However, some degree of 
differentiation or discriminant validity of the present ratings is sug- 
gested by analyses of the MED, ENG, and SCI samples. The 
median inter-trait, intra-sample criterion correlations for five traits 
in these three samples is +.37, a value considerably lower than the 
rating reliabilities. 

CPI Scales 


Predictor scales Do, Sy, Re, S, Ie, and Fe were empirically con- 
structed; Sp, Sa, Sc, and Fz were constructed by rational item selec- 
Чоп and internal consistency analysis. The sign of Sc is reversed in 
all analyses to correspond to the original scale (Impulsivity) and 
the trait ratings. The signs of Fz and Fe are reflected in samples 
MED, ENG, and SCI to correspond to the directionality of the 
rigidity and masculinity ratings. 


Correlation. Analysis 


All entries r in Tables 3 and 4 are Pearsonian correlations with 
the exception of the high-school ты and ты values mentioned. The 
predictor-criterion relationships are exhaustive, i.e., all logically 
related CPI-rating criterion dyads available for each sample are 
included. Three instances of zero predictor-criterion validity are 
shown but left unanalyzed with respect to the suppressor paradigm. 
Some instances of very low validities are included for completeness 
although the potential predictive gains are negligible. Acquiescence 
was investigated as a suppressor variable (‘Table 4) in only those 
instances where the predictor scales’ keying proportions substanti- 
ally deviate from .50. 

The multiple correlations for criterion prediction by v and s com- 
bined were computed and contrasted with the original validities by 
subtracting r,.? from Р, „2. The “Gain” column in Tables 3 and 4 
shows this difference, the increase in variance accounted for by the 


CHARLES DICKEN 711 


TABLE 4 
Acquiescence (Acq) as a Suppressor (s) 


Sample v c Te To Tr Reve Gain Mode 
SCI* Re Respon 38 37 13 49 10* Р 
HSM Se Impuls 18 01 36 19 00 VS 
HSF Se Impuls 14 04 32 14 00 VC(VS) 
MED & Impuls 22 09 36 22 00 ҮС(У8) 
SCI Sc Impuls 28 —02 48 33 08 VC(S) 
MED Ie Int Comp 34 00 —07 34 00 vs 
ENG Ie Int Comp 32  —03 16 33 01 vs 
SCI Ie Int Comp 41 06 03 41 00 vs 
MED Ех Rigid 12 —06 29 16 01 ҮС(У8) 
ENG Fx Rigid 21 17 20 25 02 VC(VS) 
SCI Fx Rigid 23 04 49 29 03 VC(S) 
CW Fx Cog Flex 26 08 —29 31 08 VO(VS) 


* Acq scores for the high school responsibility nominee samples not available. 


predictor variables when they are “corrected” for response set. Gain 
values marked * and ** indicate R >r at the .05 and .01 significance 
levels, respectively (McNemar, 1949, p. 266). The “Mode” column 
is а classification of the manner of operation of the suppressor 
paradigm or the basis of its failure as follows. 
S Significant gain, s functions as а suppressor: high correla- 
tion with v, essentially no correlation with c, B, is negative. 
NS Significant gain, s is a negative suppressor: correlation with 
v opposite in sign to correlation with c. 
P Significant gain, s is a predictor (В, positive). 
VC Мо significant gain, г not significantly greater than zero. 
VS Мо significant gain, т, too low (less than .40). 
SC Мо significant gain although rve significant and ry, = .40: 
Ts too large for suppression and too small for s to function 
as a predictor.5 


Mea morn iat T) 

Since R is a function of three continuous variables, any categorization of 
the "mode" of the suppressor paradigm is arbitrary. Choice of the criterion of 
statistical significance of ree or R > т is particularly arbitrary when N's vary. 
The VC category suffers, for instance, from the possibility of a significant 
R > r even when res fails of significance, and conversely a nonsignificant 
R > r when re is significant. Category VC was adopted to reflect the fact 
that rs. must be of a certain magnitude before any level of 7», is effective in 
producing a substantial gain. Similarly, the arbitrary level of т, = .40, the 
point of a 10% gain, was adopted to reflect the need for relatively large values 
of Tes, even though a significant gain might be possible with a smaller value 
if the original validity were great enough. In a majority of instances the mode 
18 clearcut: the categories adopted are assigned to all instances to give a 
general impression of the operation of the suppressor paradigm. Values of 
Te < 40 have been noted secondarily in instances of category vC, and one 
instance of satisfaction of the suppressor model which resulted in a nonsignifi- 
cant gain because of a small ree noted (S). 


712 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Results 


Suppression of desirability resulted in significant predictive gain 
in only 4 of 24 comparisons in the high-school data. In the non- 
high-school data, only 2 of 36 comparisons show a significant sup- 
pression effect, a result attributable to chance. There is no instance 
in the grand total of 50 comparisons of a large gain in validity by 
suppressing desirability. In almost half the instances in which R 
significantly exceeds T», s functions as a predictor. The expectation 
that correcting personality scores for individual differences in de- 
sirability responding will increase validity is not fulfilled. There 
were no instances of significant gain in validity by suppression of 
acquiescence variance. 


Instances of Suppression 


The desirability scales are most consistently correlated with the 
predictor scale Sc. The r,, values invariably exceed .40. Suppression 
of desirability in Sc accounts for four of the six instances of a sup- 
pression effect. The failures are due to suppressor-criterion overlap 
or invalidity of v. Of the CPI scales studied, Sc is the only one con- 
structed with reference to an “unfavorable” trait (Impulsivity), & 
similarity to the MMPI clinical scales in which substantial desir- 
ability variance has been reported. 

Predictor Re might be expected to contain considerable desira- 
bility variance because of the highly evaluative nature of the trait. 


TABLE 5 


Average Correlations of Desirability Measures with “Desirable” Criteria* 
and “Desirable” Predictors» 


Gi SD 

Sample N* Tee Te Tee Te 
HMS 6 00 22 TEER. 
HSF 6 11 36 16 52 
All HS 12 06 29 14 39 
MED 6 07 28 12 29 
ENG 6 08 31 13 33 
SCI 6 04 28 03 21 
CW 3 -15  -01 -16 00 
All Non-H8 21 03 25 06 24 
Total 33 04 26 09 29 


* Values of ree involving Impuls and Rigid reflected. 
* Values of re, involving —Se, —Fe, —Fz reflected. 
* М = number of correlations on which average value based. 


CHARLES DICKEN 713 


In the HSM sample, Gi functions as a suppressor for Re, but other- 
wise the desirability scales either fail to account for enough predic- 
tor variance or are criterion-related. An interesting paradox arises 
in the case of Sp, a scale also a priori suspect of a substantial desira- 
bility component. High desirability-responding high-school males 
tend to score low on Sp, enough so that s functions as a negative 
suppressor in one instance, High desirability-responding high-school 
females tend to score high on both Sp and the criterion, s functioning 
as a predictor of the criterion ratings. This suggests that the females 
are more successful simulators in relation to the raters or that social 
desirability tendencies are seen as an aspect of social presence in 
females but of its absence in males. ' 


The Suppressor-Criterion Relationship 


Where suppression does not fail because of insufficient suppressor- 
predictor-association (below), the model tends to be vitiated by low 
but positive association of the suppressors and the criterion. The . 
average fs values in Table 5 are all positive except for the CW 
sample. Suppressor SD, typically more correlated with the predictors 
than is Gi, also correlates more with the criteria, which nullifies its 
advantage. Instances of s predicting the criterion have been noted. 
Association of s with c is not important in the case of the acquies- 
cence scale, with the exception of the one instance where s is a 
predictor. 


Predictor Variance Associated With Response Set 


The preponderance of failures in predictive gain stem from an 
insufficient level of ғ,,. Low validities play a role, but the typical 
7» values allow only negligible proportional gains regardless of 
validity. Table 5 shows the average 7, values for the desirability 
suppressors in each sample and for the total high-school and non- 
high-school samples. Signs of predictors Sc, Fe, and Fx which were 
reflected in Tables 3 and 4 were again reflected before computing the 
average r’s, so that the correlations of “desirable” predictor scales 
with the desirability measures is represented in all instances. 

Somewhat more of the variance of the high-school females’ pre- 
dietor scales is associated with the desirability measures than for 
the high-school males, and SD is somewhat more effective than Gi 
in accounting for predictor variance. Only in the most favorable 
case (HSF-SD) does the average Te exceed .40, and it is this case 


7M EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT Ag 


that has the greatest suppressor-criterion relationship. For HSM, 
all the other samples, and for all samples combined, less than 10% 
of the predictor’s variance is associated, on the average, with the 
desirability measures. 

These low desirability-predictor correlations and the results ofa 
independent study suggest social desirability is less important in 
the CPI than might have been assumed from previous questionnai 3 
studies, most of which have been based on the MMPI. Fordye 
(1956) found an average correlation of .60 between Edwards' SD. 
scale and the MMPI clinical scales, but Goldberg, Rust, and Korn 
(unpublished data, Stanford University Counseling Center, 1960) 
found average correlations of only .20 (88 college males) and .34 
(39 college females) between Wiggins’ (1959) MMPI social desi 
bility scale and the 15 CPI personality scales. 

To investigate the role of desirability in the CPI further, item 
desirability values (Messick & Jackson, 1961) based on judgmen is 
of 83 male and 88 female collegians, sex samples pooled, were corre- 
lated with CPI normative item endorsement frequencies (313 та е 


of the variance of CPI item endorsement frequency is associated. 
with mean item desirability level. The stability afforded both the: 
desirability and communality variables by the large numbers of 
judges and respondents lends confidence to the obtained correlation 


of psychopathological content (low desirability-low endorsement) 
than non-MMPI items, 

With respect to acquiescence, even the most imbalanced predictor 
keys (Sc and Fz) are not frequently enough associated with the 
acquiescence scale to indicate that acquiescence variance is a pre- 
dictive hazard. ; 


Discussion j 


Like good men, good suppressor variables are hard to find. In 
addition to the studies indicating ineffectiveness or minimal effec- - 


CHARLES DICKEN 715 


* tiveness of the MMPI K correction and of acquiescence keys, un- 
published studies by Harrison Gough (personal communication) 
and by the Personnel Research Branch, Adjutant General’s Office 
(personal communication from Jack Sawyer) which failed to find 
- means of improving validity by suppression of test-taking attitudes 

may be cited. Although the response set measures used in sup- 
pressor studies have often been quite carefully constructed, are 
reliable and capable of reflecting a range of individual differences, 
and are effective in detecting extremes of test-taking attitudes 
(faking), they do not ordinarily account for enough criterion- 
irrelevant predictor variance to warrant their use in correcting the 
"Scores of typical assessment subjects. Two different rationales were 
‘employed in constructing the desirability measures used in the 
present study; neither is effective. It seems unlikely that further 
refinements of scaling the desirability variable will be fruitful in 
this context. Norman (1961) points out that until predictors can 
be developed that account for more criterion variance than is us- 
ually the case, the absolute predictive gain from suppression where 
Мт. is at or about .40 is negligible, and suggests that attention 
Should be given to discovering independent primary predictors 
rather than suppressor variables. 

The criterion is probably the most questionable element in the 
evidence pointing to the futility of attempting to control response 
Set variance. Criterion measures with more construct validity than 
Taüngs or psychiatric or socially-defined classifications might, be 
best predicted by personality measures freed from social desira- 
bility. While composite behavior ratings such as those obtained at 
IPAR are probably about as valid as is possible with respect to 
Such factors as rater training, spectrum of behavior observed, re- 
liability, and minimization of rater bias, ratings may not be a 
Suitable basis for evaluating either test validity or the response set 
Problem. The best ratings may show contamination by the social 
desirability factor that a better criterion would not. 

The significance of this for the present data is limited, however, 
by the fact that low Tvs values rather than high fso values are the 
Most frequent source of difficulty. 

Some degree of desirability “bias” on the part of the subject may 
be a relevant aspect of personality, perceived and (properly) in- 
cluded in the criterion by raters. Above-average self-esteem, or 
Absence of psychopathology may result in above-average social desir- 


716 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


ability responding, as may actual possession of more than the av 
age amount of desirable personality traits. The positive relationship 
of MMPI K to such variables as class status (Dahlstrom & Wi 
1960), improvement in psychotherapy (Barron & Leary, 2 
pilot adjustment (Fulkerson et. al, 1958) may indicate 
defensiveness in groups with more favorable status but is also 
interpretable as indicating that moderately elevated K scores reflect 
good personality integration. Anxiety scales such as Taylor's MAS 
or Welsh's А (Welsh, 1956) are substantially (negatively) related 
to social desirability and yet appear to have considerable constru 
validity. Schultz (1962) contends that factors which best accouni 
for variance in groups of items or tests (factors frequently inter- 
preted as social desirability or acquiescence) may reflect a combi- 
nation of content and response sets. He suggests, ‘Edwards’ SD 
seale accounts for much of the variance of other MMPI scales since 
its keying is similar to those scales in terms of Social Desirability, 
Acquiescence, and content” (Schultz, 1962, p. 34). 

The available evidence supports Gough's position that social 
desirability variance need not and perhaps should not be removed | 
from scales designed to measure personality traits of the type 
studied here. The kind of interpretation placed on the social desira- 
bility variable (Edwards, 1953, 1957b, 1962; Edwards & Walker, 
1961) which Wiggins (1962) termed “sinister” does not seem justi- 
fied. Wiggins (1962, p. 226) notes the logical similarity of the dez 
sirability and communality concepts, pointing out that it is nof 
surprising ^. . . that the majority of normals will endorse what is 
considered to be the acceptable response by the majority of nor- 
mals.” A desirability-communality correlation such as observed 
for CPI items, leaves considerable latitude for deviation of the 
group item endorsement level from that predictable from item де 
sirability. An individual has even more latitude for deviating fr on 
the desirable response on an appreciable number of items, sine 
individuals would deviate from the communal response even й 
desirability and communality were perfectly correlated in grou| 
data. Even though probability of a desirable response is hig 
(e.g., .81, Edwards, 1962), the number of nondesirable response] 
which will typically occur (91 in a 480 item inventory) is sufficiet 
to elevate or depress several personality scores quite distincti rel 
Several studies (Heilbrun & Goodstein, 1961; Rosen & Mink, 196%) 


CHARLES DICKEN 717 


| Taylor, 1959, 1961) document the willingness of at least some Ss 
to make self-descriptive responses which run counter to their own 
desirability judgments. CPI Re, is made up of items for which the 

T yed response is almost invariably “desirable,” yet individuals 
‚ differ in the number of items endorsed and these differences predict 
non-test behaviors. In the case of “subtle-zero” items (Meehl, 1945) 
the desirable response is the significant one: deviations from the 
| norm in terms of hyper-desirability or hyper-communality are in 
this instance diagnostic. 

It seems unremarkable that mean personality scale scores can 
be predicted by considerations of desirability (Edwards, 1962). 
Mean scores are in fact used to reflect response communalities in 
establishing norms for gauging an individual's distinctiveness rela- 
tive to the normative group. The present data do not indicate a 
high degree of predictability of individuals’ scores on the basis of 
Social desirability. The high individual predictability implied by 
| Edwards and Walker’s (1961) study appears equivocal because of 

differences in the lengths of the scales across which an individual’s 

desirability-personality score correlation was computed. 

The present data indicate accounting for individual differences 
їп acquiescence is not worthwhile. This finding is consistent with 
item reversal studies of the Taylor MAS (Chapman & Campbell, 
1959) and the MMPI (DeSoto & Kuethe, 1959; Dicken & Van 
Pelt, unpublished data) which suggest acquiescence is of relatively 
little importance in determining endorsement of items which make 
Specific personal references. Factor analyses indicating substantial 
acquiescence variance in personality scales may be at least in part 
^ result of content differences in items worded in a positive or 
negative fashion. Schultz (1962) found only a very small asquies- 
cence factor when item content and keying direction were system- 

| tically counterbalanced. 


Summary 


Measures of good impression, social desirability, and acquies- 
_©епсе were used as suppressor variables with nine personality scales 
Of the California Psychological "Inventory. Nine corresponding be- 
Avior ratings were used as criteria in one or more of six inde- 
‘Pendent samples, Significant gains in validity by accounting for 

800d impression and desirability were rare. No gain in validity re- 


. 


718 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMEN T 


sulted from suppressing acquiescence. Existing methods of correct- | 
ing for response set variance in personality scales do not appear 
pragmatic. The importance of social desirability and acquiescence 
in questionnaire personality assessment may have been overem- 
phasized. 


REFERENCES 


Barron, F. and Leary, T. “Changes in Psychoneurotic Patients with 
and without Psychotherapy.” Journal of Consulting Psychology, 
XIX (1955), 239-245. 

Berg, I. “Response Bias and Personality: The Deviation Hypothe- 
sis.” Journal of Psychology, XL ( 1955), 61-72. 

Chapman, L. and Campbell, D. “Absence of Acquiescence Response 
Set in the Taylor Manifest Anxiety Scale.” Journal of Consulting 
Psychology, XXIII ( 1959), 465—466. ` 

ch, A. and Kenniston, К. "Yeasayers and Naysayers: Agreeing 
Response Set as a Personality Variable.” Journal of Abnormal 


Dahlstrom, W. and Welsh, G. An MMPI Handbook. Minneapolis: 
University of Minnesota Press 1960. | 


(Rev.) New York: Psychological Corporation, 1957. (a) 
Edwards, A. The Social Desirability Varinble in Personality Assess 


Edwards, A. “Social Desirability and Expected Means on MMPI 
Scales.” EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 
(1962), 71—76. 

Edwards, A. and Walker, J. “A Short Form of the MMPI: The 
Scale." Psychological Reports, VIII (1961), 485—486. i 

Fordyce, W. “Social Desirability in the MMPI” Journal of С 

„sulting Psychology, XX (1956), 171-175. ‚ 

Fricke, B. "Conversion Hysteries and the MMPI.” Journal of Cli 

„cal Psychology, XII (1956), 322-326. (a) 

Fricke, B. “Response Set as a Suppressor Variable in the OAIS an 

the MMPI? Journal of Consulting Psychology, XX (1956) 


CHARLES DICKEN 719 


erson, 8. "An Aequiescence Key for the MMPI.” Report No. 
58-71. School of Aviation Medicine USAF, Randolph AFB, 
Texas, 1958. 
Ikerson, S., Freud, S., and Raynor, G. “The Use of the MMPI in 
the Psychological Evaluation of Pilots.” Journal of Aviation 
Medicine, XXIX (1958), 122-129. 
ugh, H. “Simulated Patterns on the MMPI.” Journal of Abnor- 
mal and Social Psychology, XLII (1947), 215-225. 
gh, H. “On Making a Good Impression.” Journal of Educational 
Research, XLVI (1952), 33-42. 
gh, H. Manual for the California Psychological Inventory. Palo 
Alto: Consulting Psychologists Press, 1957. 
nley, C. “Social Desirability and Responses to Items from Three 
MMPI Scales: D, Se, and K.” Journal of Applied Psychology, 
XL (1956), 324-328. 
ley, C. “Deriving a Measure of Test-Taking Defensiveness.” 
Journal of Consulting Psychology, XXI (1957), 391-397. 
ilbrun, A. and Goodstein, L. “Consistency Between Social De- 
sirability Ratings and Item Endorsement as a Function of Psy- 
chopathology.” Psychological Reports, VIII (1961), 69-70. 
unt, H., Carp, A., Cass, W., Winder, C., and Kantor, R. “A Study 
of the Differential Diagnostic Efficiency of the MMPI.” Journal 
of Consulting Psychology, XII (1948), 331-336. bi 
kson, D. “Stylistic Response Determinants in the California Psy- 
chological Inventory.” EDUCATIONAL AND PSYCHOLOGICAL 
UREMENT, XX (1960), 339-346. ; 
ackson, D. and Messick, S. “Content and Style in Personality As- 
Sessment.” Psychological Bulletin, LV (1958), 243-252. 
ackson, D. and Messick, S. *Acquiescence and Desirability as Re- 
sponse Determinants on the MMPI.” EDUCATIONAL AND Psy- 
CHOLOGICAL MEASUREMENT, XXI (1961), 771-790. 
Мей, C. and Yaukey, D. “A Cross-Cultural Comparison of Judge- 
ments of Social Desirability." Journal of Social Psychology, 
- XLIX (1959), 19-26. ADEM 
ubin, A. “Some Formulae for Use With Suppressor Variables. 
Fae TONAL AND PsYcHOLOGICAL MEASUREMENT, XVII (1957), 
6. 
acKinnon, D., Crutchfield, R., Barron, F., Block, J., Gough, H, 
and Harris, R. “An Assessment Study of Air Force Officers. 
саша Report WADC-TR-58-91. Lackland AFB, Texas, 


Kinley, J., Hathaway, S., and Меећ, P. “The MMPI: VI. ‘The 
K Scale’.” Journal of Consulting Psychology, XII (1948), 20-31. 
cNemar, Q. “The Mode of Operation of Suppressant Variables. 
American Journal of Psychology, LVITI (1945), 544-555. à 
“Nemar, Q. Psychological Statistics. New York: John Wiley & 
ons, 1949, 
*ehl, P. “The Dynamics of Structured Personality Tests." Journal 
‚ of Clinical Psychology, I (1945), 296-303. 
cehl, Р. and Hathaway, S. “The K Factor as a Suppressor Vari- 


° 


720 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


able in the MMPI." Journal Applied Psychology, XXX; 
(1946), 525-564. AE | 

Messick, S. and Jackson, D. “Desirability Scale Values and Disper- 
sions for MMPI Items." Psychological Reports, VIII (1961), 
409-414. 5 

Monachesi, E. "Personality Patterns of Juvenile Delinquents as In- 
dicated by the MMPI." In Hathaway, S. and Monachesi, E 
(Eds.) Analysing and Predicting Juvenile Delinquency with the 
MMPI. Minneapolis: University of Minnesota Press, 1953. | 

Norman, W. “Problems of Response Contamination in Personalit 
Assessment.” ASD-TN-61-43. Personnel Laboratory, Lacklanj, 
AFB, Texas, 1961. 

Peters, O. and Van Voorhis, W. Statistical Procediires and The: 
Mathematical Bases. New York: McGraw-Hill, 1940. | 

Rosen, E. and Mink, Shirley. “Desirability of Personality Traits эе 
Perceived by Prisoners.” Journal of Clinical Ps: chology, XV19 
(1961), 147-151. 

Schmidt, H. “Notes on the MMPI: The K Factor.” Journal of Cong 
sulting Psychology, XII (1948), 337-342. u 

Schultz, C. “Response Set Factors Revealed by Factor Analysis 6; 
an Unconfounded Item Pool." Mimeographed: ONR. Contrac? 
477 (33), University of Washington, Seattle, May, 1962. 

Taylor, J. B. "Social Desirability and MMPI Performance: Thé 
Individual Case." Journal of Consulting Psychology, XXII f 
(1959), 514-517. b 

Taylor, J-B: “What Do Attitude Scales Measure: The Problem of 
Social Desirability." Journal of Abnormal and Social Psychol- 
ogy, LXTI (1961), 386-390. 

Tyler, F. and Michaelis, J. “K-Scores Applied to MMPI Scales for 

ollege Women." EDUCATIONAL AND PSYCHOLOGICAL MEASURE- 
MENT, XII (1953), 459-466. 

Welsh, С. “Factor Dimensions А and R.” In Welsh, G. and Dahl- 
strom, W. (Eds.) Basic Readings оп the MMPI їп Psychology 
rper Medicine. Minneapolis: University of Minnesota Press, 

Wiggins, J. “Interrelationships Among MMPI Measures of Dissimu- 
lation under Standard and Social Desirability Instructions.” 

„Journal of Consulting Psychology, XXIII (1959), 419-427. ; 

Wiggins, J. “Strategic, Method, and Stylistic Variance in the. 

MMPI.” Psychological Bulletin, LIX (1962), 224-242. \ 


EDUCATIONAL Ахр PsvcHoLoGICAL M IMENT 
| Vot. XXIII, No. 4, 1963 1 


d 
Д ANOTHER ATTEMPT AT CONFIGURAL SCORING! 
{ DAVID CAMPBELL 


Student Counseling Bureau 
University of Minnesota 


Tuis is a report of an attempt to develop a method of configural 
- Joring for the Minnesota Vocational Interest Inventory (МУП), 
à interest inventory developed by К. E. Clark for use with non- 
rofessional men? In earlier studies Clark (1961) has shown that 
his inventory has considerable validity using conventional item 
^oring methods, and a later study indicated that the use of re- 
‘Sponse patterns could be helpful in selecting the most powerful items 
for use on item scoring scales (Campbell, 1962). The next logical 
Step was to score those patterns instead of individual items, and the 
Tesults of that step are reported here. 

Pattern scoring has a certain seductive air about it. Intuitively, it 
ез very clear that certain configurations of responses must be 
ore meaningful than those same responses taken individually. 
(This intuition has been supported mathematically by Meehl 
(1950), Horst (1954), Lykken (1956), and Lubin and Osburn 
(1957).) If these specific configurations could be identified in some 
Manner, they should help considerably in separating criterion groups 
from men-in-general groups, various types of individuals from other 
types, successful people from unsuccessful ones, etc. Psychological 
Sts and inventories could be strengthened immensely by utilizing 
he relatively unique configurations of responses given by individ- 


* Financial support for this project came from the Graduate School, and 
Ester time from the Numerical Analysis Center, at the University of Min- 
a. 


Lb: Would like to acknowledge the early stimulation for this project pro- 
ded by К. E, Clark while I was опе of his Ph.D. candidates. 


721 


722 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


uals. An earlier article by Gaier and Lee (1953) discusses the ad- 
vantages of the configural approach at greater length. 

In assessing any configural scoring method, the investigator is, 
of course, bound by the same criteria as those used to evaluate 
traditional methods, i.e., he must show that his approach is valid 
and reliable, and there should be particular emphasis on cross 
validation. Configural scoring methods usually select a few patterns 
out of a large possible number, and there is the danger that the - 
observed frequency of any one pattern is too unstable to hold up 
under cross-validation. 

Besides these criteria, the investigator must also show that his 
approach is an improvement over conventional item scoring meth: 
ods. Configural scoring involves a great deal of tedious cleric 
work, both in the standardization proceedings and in the routine 
scoring of the individual’s responses. Because of this, it is essenti 
that any proposed pattern scoring method offer some advantage over 
current methods. Any report which ignores this aspect leaves a ve 
crucial gap in the knowledge of the new method. 

Viewed with the above in mind, i.e., cross-validity, reliability, an 
comparison with routine techniques, it is clear that configural ap- 
proaches have not yet succeeded. Apparently without exception, 
attempts to date have not made it by the rigorous test of cross- 
validation. Several studies show a rather severe cross-validation 
shrinkage (MeQuitty, 1954, 1957a; Forehand & McQuitty, 1959; 
Lee, 1956), while others fail to report any attempt at cross-valida- 
tion (MeQuitty, 1957b, 1960a, 1960b, 1960c, 1961a, 1961b, 1962, 
1963a, 1963b). 

When configural scoring approaches are compared with the more 
traditional methods such as multiple regression, etc., the results are 
invariably the same: configural scoring does better in the validation 
sample, but in the cross-validation sample the traditional methods 
hold up better (Forehand & McQuitty, 1959; Lee, 1956; McQuitty, 
1957a, 1957). 

None of the above studies report any reliability data. 

Cross-validation has been the big hurdle. One potential explana- 
tion for this was mentioned above, i.e., the selection of a few patterns 
from the large number possible. Under these circumstances the fre- 
quencies of any one pattern may be too unstable to use in deciding 
which patterns to score. To overcome this, either a very large num- 


DAVID CAMPBELL 723 


of individuals must be used in the standardization group, or a 
nethod must be found to limit the number of possible patterns. 
he current study attempted the latter approach, using as pat- 
Ins the rank orderings given by individuals to the triads on the 
МУЛТ. In this way, the number of possible patterns for each triad 
аз held to six, a relatively easy number to manage. As noted in 

е detail below, this method was not very successful in preventing 
validation shrinkage. 


Method 


‘The raw data used were some of those collected by Clark in the 
riginal standardization of the МУП. Three occupational groups 
bre used: painters, electricians, апа IBM operators, each split into 
falidation and cross-validation groups. These groups were com- 
pared with Clarks’ tradesmen-in-general. 

On the МУП, an individual is faced with triads of three activities 
asked to chose in each triad the one liked best and the one 
iked least. Assuming that the remaining statement lies somewhere | 
een the other two, essentially he has rank ordered the activities 
ving only two responses. 

In this study, each of the six possible rank orderings of the triad 
S used as a pattern, and scales were developed using these pat- 
s as the individual units of the scale in the same manner that 
"usual scoring methods use individual items. 

per cent response to each of the possible rank orderings of 
triad (the MVII has 190 triads) was tabulated for the valida- 
groups and the tradesmen-in-general group, and а series of 
was developed for each occupation. The first scale contained 
patterns showing a 25 per cent or more difference between the 
erion and tradesmen-in-general groups; the next scale ineluded 
initial patterns and added those showing a 24 per cent differ- 
the next scale included the 23 per cent items, ete. The series 
tinued until the scale contained up to 100 patterns, or until the 
cent level was reached, whichever came first. Table 1 shows 
number of patterns scored at each level for each occupational 


er the scales were established, the validation (VAL), cross- 
dation (X-VAL), and the tradesmen-in-general groups were 
» and all of the eriterion groups were compared with the 


724 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


к 
E 


$5ESREEREZSSEZ 


8388595389558 


02 


саж 681 = N 297 = N  Perog 
апо) suyd 
PON 


т = N 681 = N 


1e 


eg OF 
9 OF 
1% Le 
ly OF 
Sy 68 
FF OF 
oF IF 
GF FF 
PF 9F 
FF 9r 
FF lt 
PF lt 
ЎР ly 
£r SF 
[54 SF 
[44 SF 
68 Sy 
£g 8F 

68 = Л #81 = 

апо) 

TVAX TVA 
depoa 95 


DAVID CAMPBELL 725 


tradesmen-in-general group, using the Tilton measure of per cent 
overlap. This statistic gives an indication of how many scores in 
one distribution can be matched by scores in another distribution. 
(Note that the usual t-test between means is not particularly useful 
here; all of the occupational groups are highly statistically sig- 
nificantly different from the tradesmen-in-general group. The per 
cent overlap gives a better indication here of the magnitude of the 
practical difference.) 

The test-retest reliability of each of the scales in the IBM Opera- 
tor series was obtained by scoring a group of trade school students 
who took the MVII twice within a 30-day period. 


Results and Discussion 


The results of the comparisons are reported along with Clark’s 
results in Table 1. It is obvious that pattern scoring did not do 
as well as item scoring, and, as in other studies, the cross-validation 
shrinkage of the pattern scoring scales was severe. This type of 
Scoring offers no improvement over item scoring of the МУП; 
indeed, it would be detrimental. 

The test-retest reliabilities of the IBM operator scales are re- 
Ported in the right-hand column of Table 1. These reliabilities 
Supply some relevant data to the question of the stability of the 
Tesponse patterns used in this scoring method. If the cross-valida- 
tion shrinkage is caused by the instability of the patterns, then 
these test-retest reliabilities should be lower than those of item 
Scoring scales, Surprisingly, these pattern scoring reliabilities are of 
the same general magnitude as those reported by Clark for the 
МҮП item scoring scales. This seems to indicate that patterns are 
^5 stable over time as the individual item responses. Я 

Even with this one somewhat positive finding, the results of ibis | 
study should be discouraging to proponents of configural scoring. 
Although configural scoring remains seductive, when the results of 
this study are added to the generally negative ones already Pie 
Ported, it is clear that any fulfillment through this method remains 
In the future, 


Summary 


Three occupational scales were developed for the Minnesota Vo- 
ational Interest Inventory using pattern scoring methods. The pat- 


* 


726 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


terns used were the rank orderings of the statements within each of 
the triads of the MVII. 

The pattern scoring results were compared with Clark’s results 
on the original standardization of the MVII. The conclusions were: 

1. Clark’s item scoring procedures are more valid than these pat- 
tern scoring procedures. 

2. The pattern scoring methods showed considerable cross-valida- 
tion shrinkage. 

3. The pattern scoring scales are as reliable as the item scoring 
scales. 


REFERENCES 


Campbell, D. P. “The Use of Response Patterns to Improve Item 
Scoring.” Journal of Applied Psychology, XLVI (1962), 194- 


197. 

Clark, K. E. Vocational Interests of Non-Professional Men. Min- 
neapolis: University of Minnesota Press, 1961. 

Forehand, G. A., Jr. and McQuitty, L. L. “Configurations of Factor 
Standings as Predictors of Educational Achievement." EDUCA» 
TIONAL AND PSYCHOLOGICAL MEASUREMENT, XIX (1959), 31-43. 

Gaier, E. L. and Lee, Marilyn C. “Pattern Analysis: The Configural 
Approach to Predictive Measurement." Psychological Bulletin, L 
(1953), 141—149. 

Horst, P. “Pattern Analysis and Configural Scoring.” Journal. of 
Clinical Psychology, X (1954), 3-11. 

Lee, Marilyn С. “Configural vs. Linear Prediction of Collegiate 
Academie Performance.” Unpublished Ph.D. thesis, University 

ў Б V 1956. 

ubin, А. and Osburn, Н. С. “А Theory of Pattern Analysis for the 
Prediction of Qualitative Criterion.” Psychometrika, XXII 
(1957), 63-73. 

Lykken, D. F. “A Method of Actuarial Pattern Analysis.” Psy- 
chological Bulletin, LITI (1956), 102-107. 

McQuitty, L. L. “Pattern Analysis Illustrated in Classifying Pa- 
tients and Normals.” EDUCATIONAL AND PsycHoLoGicAL MEAS- 
UREMENT, XIV (1954), 598-604. 

McQuitty, L. L. “Isolating Predictor Patterns Associated with Ma- 
jor Criterion Patterns.” EDUCATIONAL AND PSYCHOLOGICAL MEAS- 
UREMENT, XVII (1957), 3-42. (a) 

MeQuitty, L. L. “Elementary Linkage Analysis for Isolating Or- 
thogonal and Oblique Types and Тура! Relevancies.” EDUCA 
m ans Ps¥cHOLocicAL MEASUREMENT, XVII (1957), 207- 

Мыш, eke: “Hierarchical Бшм Analysis for the Isolation of 

ypes. DUCATIONAL AND Рѕүсн REMENT, 
(i960), 55-67. (a) OLOGICAL MEASU 

McQuitty, L. L. “Hierarchical Syndrome Analysis." EDUCATIONAL 

AND PSYCHOLOGICAL MEASUREMENT, XX (1960), 293-304. (b) 


DAVID CAMPBELL 727 


McQuitty, L. L. “Comprehensive Hierarchical Analysis." EDUCA- 
TIONAL AND PSYCHOLOGICAL MEASUREMENT, XX (1960) 805- 


816. (c) 

McQuitty, L. L. “A Method for Selecting Patterns to Differentiate 
Categories of People." EDUCATIONAL AND PSYCHOLOGICAL MEAS- 
UREMENT, XXI (1961), 85-94. (a) 

MeQuitty, L. L. “Тура! Analysis.” EDUCATIONAL AND PSYCHOLOGI- 
CAL MEASUREMENT, XXI (1961), 677—695. (b) 

MeQuitty, L. L. “Multiple Hierarchical Classification of Institutions 
and Persons with Reference to Union-Management Relations 
and Psychological Well-Being.” EDUCATIONAL AND PSYCHOLOGI- 
CAL MEASUREMENT, XXII (1962), 513-531. 

McQuitty, L. L. “Rank Order Typal Analysis.” EDUCATIONAL AND 
PsvcHoLocicau MEASUREMENT, XXIII (1963), 55-61. (a) 

McQuitty, L. L. “Best Classifying Every Individual meen 
Level." EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 
(1963), 337-345. (b) К 

Meehl, Р. E., “Configural Scoring.” Journal of Consulting Psychol- 
ogy, XIV (1950), 165-171. 


Me 
PU v 
«аф: э Amy 2 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 4, 1963 


THE EFFECT OF CORRELATION ON THE REPEATED 
MEASURES DESIGN + 


ROBERT E. LANA 
Alfred University 
AND 


ARDIE LUBIN 
Walter Reed Army Institute of Research 


Tum focus of this report is on the use of Analysis of Variance 
(Anova) as an inferential technique in psychological experiments 
when repeated measurements or multiple criteria are used. Many 
Anova techniques have been taken directly from agricultural re- 
search. However, psychological experiments with live animals in- 
volve difficulties which are not present when the basic experimental 
unit is a plot of earth. In particular, taking multiple measurements 
of the same subject and/or subjecting him to multiple treatments 
will, in general, lead to departures from the assumptions of statisti- 
cal independence of residuals and additivity of treatment effects. 

Test carry-over effects, treatment carry-over effects, interaction 
between test carry-over and treatment effects, etc., аге important 
sources of bias, but are too complex to be dealt with here, except in 
а peripheral way. The problem of correlated observations is rela- 
tively simple. In general, the correlations сап always be calculated 
апа Hotelling’s exact multivariate Anova applied. In many diss 
even this is not necessary, since the direction of the bias in univari- 
ate Anova is always toward overestimation of the significance of 
the F ratio, Upper and lower limits of the true significance level can 


readily be estimated. 
с _ ЖАМАЙ. 

1 This research was supported by the National Institute of Mental Health, 
United States Public Health Service Grant M-4113 (A) to "kr doc: 
Versity on which the first author served as Principal Investigator. 


729 


730 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


А study is classified as a repeated measures design whenever th 
total d.f. in a univariate Anova is greater than N, the number û 
subjects. This implies that some subject must have been measun 
more than once. In the typical repeated measures experiment, 
is a score matrix of N subjects by k trials which is treated as a 
way Anova. Usually the deviance about the grand mean (i.e., і 
sum of the squares) is divided into Between S's with N—1 d.f., 
tween Trials with k—1 d.f., and a Subject-by-Trial interaction 
(N—1) (k—1) d.f. 

“Multiple criteria” designs are those where a subject receives sey 
eral scores for each test. For example, in a multiple choice reaction 
time task, the subject might be scored on accuracy and speed. ] 
this article, we will treat “multiple criteria” designs as if they w 
repeated measures without the element of order found in succe 
measurements on the same subject. 

In order to discover the frequency of repeated measures de 
and how they were analyzed in psychology, three journals (Jour 
of Experimental Psychology, Journal of Physiological and Com- 
parative Psychology, and Journal of Abnormal and Social Psychol 
ogy) were examined for the years 1957 through 1959. About half of 
the articles (593) used Anova. Almost 40 per cent of these Anova 
articles used a repeated measures design. Another 30 per cent 
multiple criteria. In other words, about one-third of all publi 
articles used designs with correlated observations. 

Usually a univariate Anova procedure was followed in analyzin 
the data. Two-way Anova was often applied to the subjects-E 
trials seore matrix. This procedure is recommended by all of 
standard statistical psychology texts; e.g., Edwards (1960), 
quist, (1953), and MeNemar (1962). Multiple criteria were usua 
analyzed by applying Anova to each variable separately. In o 
one study was the possibility of unequally correlated observations 
taken into account. j 

Since repeated measures designs are the most frequently used ex- 
perimental designs and are almost always analyzed incorrectly, t 
basis for these incorrect analyses must be examined. The difficulty 
occurs because a univariate Anova is applied to repeated measure 
ments. One of the basic assumptions of the Anova is that, for ever, 
group of subjects, the observations have zero or equal correlations 
In general, this will not be true for repeated measurements. 4 


LANA AND LUBIN 731 


The purpose of this paper is: 1) to point out the necessity of 
utilizing multivariate rather than univariate Anova when repeated 
measurements are taken on the same organism; and 2) to sum- 
marize Box's (1954) findings and the discoveries of others on the 
effects of utilizing univariate Anova for repeated measurements. 

Difficulties in the usual analysis of the repeated measures design 
ean be examined through use of a hypothetical psychopharmacologi- 
cal experiment. Let us assume that an experimenter believes that 
slow reaction times are characteristic of paranoid schizophrenics, 
and he thinks that this symptom can be alleviated by chronie ad- 
ministration of some tranquillizing drug. He then selects a sample 
of N paranoid schizophrenics, puts each patient on a maintenance 
dose and starts testing reaction time once a week. At the end of k 
weeks, the reaction time scores can be arranged as a rectangle, M 
rows by k columns. The statistical analysis indicated by such texts 
as Edwards (1950), Lindquist (1953), and MeNemar (1955), would 
be a two-way analysis of variance, with (N—1) (k—1) degrees of 
freedom for the subject-by-week interaction effect. The significance 
of the differences between the k weekly means would be assessed by 
an F ratio using the subject-by-week interaction as the error term. 
Let us call this ratio the “univariate F.” 

One of the basic assumptions for the use of subject-by-week inter- 
action as the error term, is that all observed scores have equal 
correlations, However, with repeated measures it is usual to find 
that two successive trials have high positive correlations, but trials 
separated in time have lower correlations. In 1948, Kogan (1948) 
discussed some of the difficulties of univariate Anova and suggested 
that positive inter-correlations would yield a positive bias, i.e., that 
the univariate F ratio overestimates the significance of the differ- 
ences between k correlated means. 


Multivariate Analysis of Variance 


Exact multivariate Anova procedures covering the situation of 
correlated means were devised by Hotelling, Fisher, and Wilks dur- 
ing the decade 1930 through 1939. The exact treatment of correlated 
Measures with a multivariate normal distribution (mutual linearity, 
Marginal normality, and homoscedasticity) was first given by Ho- 
telling in 1931. Recently, textbooks devoted almost entirely to 
multivariate Anova have been written by T. W. Anderson (1955) 


782 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


and M. G. Kendall (1956). However, there is only one published 
text in psychological statistics that presents the appropriate multi- 
variate test for differences among correlated means (Winer, 1962). 

In 1954, G. E. P. Box (1954) devised an ingenious method for 
assessing the approximate effect of unequal correlations and vari- 
ances on the univariate F ratio. He found that Kogan’s conjecture 
was correct. The univariate F ratio has a positive bias, yielding sig- 
nificant results too often. Roughly speaking, the effect of unequal 
correlations between the trials is to reduce the apparent number of 
degrees of freedom in the numerator and denominator of the F ratio. 

Box's model, and the conclusions he drew, are worth sketching 
here, since they demonstrate why multivariate Anova, rather than 
univariate Anova, is most generally appropriate for correlated ob- 
servations. Box makes two assumptions: 


a) the vector of scores for any subject is statistically independent 
of the score vector for any other subject, under the null hypothesis; 


b) each vector is a sample from the same multivariate normal 
population. 


In terms of our hypothetical psychopharmacological study, this 
means that the N paranoids are randomly selected and the relation 
between the scores of any two weeks, say week s and week f, is 
bivariate normal. The variance of week t, v, need not equal Vss; 
Ти does not necessarily equal the correlation between any other pair 
of weeks, 

C. R. Rao (1952, pp. 239-244) showed how Hotelling’s Т? could 
be adapted to give an exact; test of the differences between correlated 
means. Basically, Rao takes a linear function of the k scores and 
compares the mean of this linear function to the variance of the 
linear function. [A convenient computation routine for this test is 
given by Т. W. Anderson in his text (1958, par. 5.3.5).] 

Using the exact multivariate approach, Box shows that under the 
null hypothesis the true distribution of the univariate F with (k—1) 
over (k—1) (N—1) d.f. can be approximately represented by the 
same F value with the degrees of freedom reduced by а fraction. 
This fraction, epsilon, is a function of the k by k covariance matrix, 


«= Hou в =| Èo EM 2], ® 


LANA AND LUBIN 733 


where vs is the covariance of ће N pairs of scores from week t and 
week s, p; is the average covariance of the week ¢ with the k other 
weeks including t, ӯ is the average variance for ће k weeks, and 
p. is the average of all К? variances and covariances. 

The maximum value of epsilon is one, and this is reached only 
when the k variances are equal and the [k(k—1)]/2 correlations 
аге constant. In this case, Box's approximation gives the exact re- 
sults; when the correlations are constant and the variances are 
equal, then the univariate F ratio can be used to give the exact 
significance level of the differences between the k correlated means. 

Geisser and Greenhouse (1958) have shown that the lowest, value 
that epsilon can take is 1/(k—1). They argue that since no one 
has shown what sample estimate of epsilon is most appropriate, and 
the robustness of epsilon has not been investigated, it is best to use 
the minimum value of epsilon for a conservative test. This conserya- 
tive test consists of computing the univariate F, and entering the 
tabulated F distribution with 1/(N—1) d.f. If the result is signifi- 
cant, there is no need to go further; the exact test would be sig- 
nificant. However, if the conservative test is not significant, one can 
now make an upper limit test of the univariate F (setting epsilon 
equal to unity). If an assumed epsilon value of unity gives a non- 
significant result, then the null hypothesis can be accepted, since no 
calculated value of epsilon can give a more significant result. How- 
ever, if using full degrees of freedom gives a significant result, then 
the research worker is in a dilemma. Geisser and Greenhouse ap- 
parently would next try Box’s approximate test, using a sample 
estimate of epsilon. We would recommend an exact multivariate test 
such as Rao’s, 

One can see that the Geisser-Greenhouse approach allows one to 
bracket the significance level of F with the same amount of com- 
Putation that is used in the usual two-way analysis of NFI 
The laborious computations for Hotelling’s exact multivariate An- 
Ova include the data necessary for a two-way Anova. Therefore, it 
Will always be profitable to try the Geisser-Greenhouse approach 
first, before proceeding to the rest of the distasteful arithmetic nec- 
essary for multivariate analysis. ; 

To illustrate, consider the worked example of myiltivariate Anoy 
given by Rao (1952, pp. 241-243). Four measures were obtained on 
each of twenty-eight subjects. After computing a covariance matrix 


7344 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


and inverting it, Rao obtained а multivariate F ratio of 6.40 with 
effective degrees of freedom, 3/25. This corresponds to a confidence 
level of about .005, thus indicating that there are some very sig- 
nificant differences between the four correlated means. 

If the usual two-way Anova is computed, a univariate F of 4.96 
is obtained. The Geisser-Greenhouse conservative test, using 1/27 
degrees of freedom, gives .04 as the lower limit of the confidence 
level. If the .05 confidence level were used, the laborious computa- 
tion of the covariance matrix and its inverse would not be necessary. - 
The Geisser-Greenhouse conservative test would have established — 
the significance of the differences between the correlated means at - 
the .05 level of significance. 

Exact multivariate Anova tests for two or more groups and two 
or more treatments are considerably more complex than the Ho- 
telling test for equality of means. It is beyond the scope of this. 
paper to provide worked examples of complex multivariate Anova. 
Interested readers should consult Geisser and Greenhouse (1958), 
as well as Danford, Hughes, and McNee (1960) for worked ex- 
amples. 


Treatment-Bound and Time-Bound Correlations 


Difficulties in computations arise because observations are not 
correlated equally. Random balanced assignment of treatments has 
been recommended (Box & Mueller, 1959) as a way of equating the 
correlation between treatment scores. Thus, for example, in testing 
k drugs, Subject A could be given drug 1 during week 1, drug 2 dur- 
ing week 2, and so on for k weeks, Subject B could be given the k 
drugs in the reverse order. Subject C would receive the k drugs in 
still another order. The orders would be such that the correlations 
between any two drug effects would involve every correlation be- 
tween the pairs of weeks, Thus, the correlation between any two 
treatments would be approximately an average of the interweek 
correlations. This randomization procedure will create a constant 
correlation between treatment effects, if the correlations between 
the scores are due solely to the temporal order of administration 
(time-bound correlations). For example, if all correlation is due to 
the auto-regression of trial (i+1) on trial i, then the correlation of — 
week 1 scores with week 2 scores will tend to be higher than with _ 
week 3 or week 4 scores; balanced randomization of treatments over _ 
weeks will create an over-all average correlation. 4 


LANA AND LUBIN 735 


But what if the treatments affect the correlations directly (treat- 
ment-bound correlations) ? Suppose the “treatments” consist of giv- 
ing various arithmetie and vocabulary tests. Regardless of the order 
in which the tests are given, the correlations between the arithmetie 
tests and the correlations between the vocabulary tests will tend to 
be greater than the correlations between arithmetie and vocabulary. 
Randomization will have little or no ameliorating effect. 

Another way of trying to equate the correlation between treat- 
ment scores is to use matched subjects. This is analogous to the 
"split-plot" design in agricultural experimentation. For example, 
one could test all N subjects with a pretest, rank order the subjects 
on the pretest scores, assign the top k subjects at random to the k 
treatments, assign the next k subjects randomly to the k treatments, 
etc. Since the assignment is completely random, it would be expected 
that the correlations between pairs of treatments would be equal. 

Q. MeNemar has pointed out (personal communication) that 
matching works only if the correlations are unaffected by the treat- 
ments, “. . . suppose that treatment B has a marked disrupting effect 
on the performance of the subjects, with the performance of some 
subjects being much depressed while others are little affected and 
still others boosted. Suppose further that treatments А, C, and D 
do not lead to such chaos. Now, wouldn't the correlations (for the 
pairs of scores) for B with A, C, and D be of a different order than the 
тв among A, C, and D?" 

Thus, the matched-subjects design fails in exactly the same case 
where balanced randomization fails—where the correlations are 
treatment-bound. However, the matched-subjects design does have 
‘the great virtue that it avoids all carry-over effects. In that sense it 
is identical to the split-plot design in agriculture where there is no 
test or treatment carry-over from one subplot of ground to another. 
This is a great advantage in psychological experimentation where 
the history of the organism is often the chief factor determining the 
response to treatment. 

If the matching is done on an appropriate variable, such as a 
pretest, the error variance can be reduced to somewhere near the 
Within S's variance of the repeated measures design. One major 
disadvantage of the matching technique is that kN subjects are used 
instead of only N. This is extraordinarily extravagant of subjects 
in а continuous treatment design such as our example, and throws 
away a great deal of relevant data. 


736 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Tests that Are Robust to Departures from Constant Correlations, 
Constant Variances, and Normality 


If the subjects are sampled without regard to their scores, then 
the vector of k treatment scores for a subject can be presumed to 
be experimentally independent of the k scores for any other subject. 
If there is a significant correlation between two such score vectors it 
must result from both subjects having a similar pattern of response 
to the k treatments. This implies that some of the score variance is 
systematic, not due to chance. 

Suppose we calculate the average correlation between all N score 
vectors and find that it is significantly positive. Then it follows that, 
for some appreciable number of subjects, the k treatments have had 
a significant similar effect on their scores. If the average correlation 
between subjects is zero, then either all the score variance is due to 
chance, or the treatments have significant but dissimilar effects for 
different subjects (i.e., there are significant subject-treatment inter- 
actions). 

Only one assumption has been made so far, that the score vectors 
are experimentally independent. The question of constant correla- 
tions between the observations does not arise and is irrelevant here. 

However, using the average correlation for all N score vectors 
raises difficulties. First, the calculation would be very laborious. 
Secondly, the sampling distribution of the average correlation will 
be difficult to compute without introducing the assumption of multi- 
variate normality. Both of these problems can be avoided by using 
Kendall’s W, the concordance coefficient.” 

To compute Kendall’s W, the scores are rank ordered from 1 to k 
for each subject. The ranks are then summed for each treatment. 
By comparing the k sums with the constant sums expected by 
chance, the average rank-order correlation between the N vectors 
can be easily computed. W is a slight algebraic variant of this aver- 
age Spearman rank-order correlation. 

However, it must be kept in mind that W is а test for the equality 
of the average ranks. It is not a test of the equality of the k observed 
score means. Inequality of the score means does not necessarily 


2 Wallis (1939) and Friedman (1937) independently of Kendall and Babing- 
E cw da devised statistics that are exact algebraic transforms 
endall's W. 


LANA AND LUBIN 737 


imply inequality of the average ranks and vice versa. Examples сап 
be constructed where the k score means are exactly identical, but 
the k average ranks differ significantly. Also, an example can be 
constructed where the average ranks are identical, but the k means 
differ widely. (Generally, for such cases to exist, skewed distribu- 
tions which differ for each treatment are needed.) When we want to 
know whether one treatment differs consistently from another treat- 
ment, and the amount of difference is irrelevant, then the rank-order 
test is appropriate. 

In many cases an a priori rank-order сап be specified for the k 
means. If the k treatments consist of amounts of sleep loss, or 
amounts of practice, or amounts of X-ray dose, etc., then we expect 
the scores on the k treatments to be a monotonic function of these 
amounts. 

Whenever a set of correlated means has a predicted rank-order, 
each subject’s obtained rank-order can be correlated with the pre- 
dicted rank-order and the average of all N rank-order correlations 
can be tested for significance. Jonckheere (1954a, 1954b) has pre- 
sented a general set of tests of this sort, using the tau of Kendall. 

Instead of using the average tau as Jonckheere does, it is possible 
to use the average Spearman rank-order coefficient, rho. Lyerly 
(1952) has described the distribution of the average rho. Tables of 
exact significance levels for the average rho are available for: N—3 
with k—2, 3, 4; N—4 with k—2, 3; N=5 with k=2; N=6 with 
0—9, (Copies may be obtained from A. Lubin.) The normal curve 
approximation is sufficient for larger values of N and К. 

A. L. Edwards (personal communication) has suggested that 
where the alternative hypothesis specifies the rank-order (as in 
Jonckheere’s rank-order test), parametric trend-fitting techniques 
may be more powerful than rank-order tests. If the linear dole 
of the observed means on their a priori rank yields а significant 
positive slope, this is an empirical demonstration that the null 
hypothesis should be rejected. | 

If there are 4 or more means, one could fit an asymptotic regres- 
sion such as a Mitscherlich equation. If the Mitscherlich exponential 
parameter is significant, this is sufficient to reject the null hypothe- 
sis. If we assume the existence of a subject-by-treatment interac- 
tion, the equation must be fitted separately for each subject, since 
the parameters can differ for each subject. In this case, the average 


738 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


slope or average exponential parameter can be tested for signifi- _ 
cance, using the Between S's variance as the error term. 


Conclusion 


This brief survey of the statistical tests appropriate to a repeated 
measures design without carry-over does not, of course, cover all 
relevant topies, but it does indicate that there are rational pro- 
cedures for treating the data which differ considerably from those 
found in the standard statistical texts for psychologists. 

To summarize the statistical recommendations made here for the 
test of differences between correlated means: the experimenter might 
very well begin with а two-way Analysis of Variance (Anova), but 
evaluate it with the Geisser-Greenhouse conservative F test. If the 
observed F is significant by the usual Anova test, but not by the 
Geisser-Greenhouse test, the experimenter might go on to the exact 
multivariate Anova test given by Hotelling’s T, or attempt curve- 
fitting. If the assumption of multivariate normality seems doubtful, 
Kendall's W might be used. If the rank-order of the means can be 
specified in advance, the use of Jonckheere’s average tau or Lyerly’s 
average rho will considerably increase the power of the significance 
test. However, the experimenter should never make a routine appli- 
cation of the usual two-way Anova unless there are compelling rea- 


sons to believe that the correlations between the observations are 
equal. 


REFERENCES 


Anderson, T. W. Introduction to Multivariate Statistical Analysis. 
New York: John Wiley & Sons, 1958. 

Box, С. E. Р. “Some Theorems on Quadratic Forms Applied in the 
Study of Analysis of Variance Problems. II. Effects of Inequality 
of Variance and Correlation Between Errors in the Two-Way - 
bw. ыд Annals of Mathematical Statistics, XXV (1954), 

Box, G. E. P. and Mueller, M. E. “Randomization and Least Squares 

E (190) AM of the American Statistical Association, LI 

anford, M. B., Hughes, Harry M., and MeNee, R. C. “Оп the 
Analysis of Repeated-Meas n iments.” Biometrika, 

ts IV (1900) тё. urements Experiments." Bio , 

wards, A. L. Experimental Design in Psychological Research. 
_New York: Rinehart & Company, 1960. [се ош 

Friedman, M. "The Use of Ranks to Avoid the Assumption of 
Normality Implicit in the Analysis of Variance." Journal of the 
American Statistical Association, XX XII (1937), 675—701. 


* 
n 
4 
| 
d 


LANA AND LUBIN 739 


Geisser, S. and Greenhouse, S. W. “An Extension of Box's Results on 
the Use of the F Distribution in Multivariate Analysis.” Annals 
of Mathematical Statistics, ХХІХ (1958), 885-891. 
Hotelling, Н. “The Generalization of ‘Student’s Ratio.’ Annals of 
Mathematical Statistics, 11 (1931) , 360-378. Y 
Jonckheere, A. R. *A Distribution-Free K-Sample Test Against 
Ordered Alternatives." Biometrika, XLI (1954), 133-145. (a) 

Jonckheere, A. R. *A Test of Significance for the Relations Between 
M Rankings and K Ranked Categories." British Journal of Sta- 
tistical Psychology, VII (1954), 93-100. (b) 

Journal of Abnormal and Social Psychology. LIV-LIX (1957= 
1959). 

Journal of Experimental Psychology. LIII-LVIII (1957-1959). 

Journal of Physiological and Comparative Psychology. 1-11 
(1957-1959) . t 

Kendall, M. G. Rank Correlation Method. London: Griffin and 
Company, Ltd., 1948. А ў 

Kendall, М. С. A Course in Multivariate Analysis. London: Griffin 
and Company, Ltd., 1956. 

Kendall, M. 5. d Babington Smith, B. “The Problem of M Rank- 
ings." Annals of Mathematical Statistics, X. (1939), 275-287. 

Kogan, L. S. *Analysis of Variance—Repeated Measurements. 
Psychological Bulletin, XLV (1948), 131-143. ‚ 

Lana, В. E. and Lubin, A. “The Use of Analysis of Variance Tech- 
niques in Psychology.” Progress report to the National ate 
of Mental Health, United States Public Health Service #M- 
4113(A), March, 1961. А , 

Lindquist, E. F. Design and Analysis of Experiments in P. sychology 
and Education. New York: Houghton-Mifllin, 1953. ient." 

Lyerly, 8. B. “The Average Spearman Rank Correlation Coefficien 
Psychometrika, XVII (1952), 421-428. hn Wiley & 

MeNemar, Q. Psychological Statistics. New York: Jo ley 
Sons, 1962. : Ў T ic "Hesearch 

Rao, C. R. Advanced Statistical doo in. Biometric Re: А 
New York: John Wiley & Sons, x n 

Wallis, W. A. “The Coreano Ratio for Ranked dere ied 
the American Statistical Association, XXXIV aas New 

Winer, B. J. Statistical Principles in Experimental Design. 
York: McGraw-Hill Book Company, 1962. j 


EpUCATIONAL AND PsYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 4, 1963 


KNOWLEDGE AND INTERESTS CONCERNING SIXTEEN 
OCCUPATIONS AMONG ELEMENTARY AND 
SECONDARY SCHOOL STUDENTS 


RICHARD C. NELSON 
Ball State Teachers College 


AccunaTE occupational information is essential to effective occu- 
pational choice, which has been explored to some extent in the litera- 
ture relating to vocational development, but occupational knowledge 
and related areas of exploration have been insufficiently studied. 

One can find material to answer the questions: What single job 
choice is preferred by each individual in a group at a given point in 
time? In what areas do young people concentrate their interests at 
a given point in time? 

"There is, however, little material to answer such questions 88: 
What do children know about occupations and how does this knowl- 
edge develop? How do various age groups compare in their occupa- 
tional knowledge? What types of oceupations have appeal to young- 
sters of various age groups? Assuming multi-potentiality as a con- 
struct, what is the range of occupations given positive consideration 
at particular points in time by individuals? 


Purpose and Problem 


In the light of the above questions, this study sought to provide 
an objective description of some elements of the occupational 
knowledge and interests of youth, especially at ages prior to high 
School. 

The problem was as follows. Given a stimulus of a group of 
occupations for identification and reaction, how do children of 
different age levels and backgrounds compare in naming and de- 


741 


742 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


scribing the occupations, and how do they compare in their reac- 
tions to the same occupations? 


The Sample 


The 595 students ineluded in the study were selected within the 
Baltimore County, Maryland, school system. Classroom groups from 
grades three, five, seven, nine, and eleven were used. Half the stu- 
dents were selected from a distriet which is semi-urban in char- 
aeter, the other half were from a semi-rural district. Due to the local 
practice of homogeneous grouping, it was necessary to select intelli- 
gence levels within which to work; the average and slightly above 
average levels were chosen. Sex and socio-economic subgroups, in 
addition to the above-mentioned grade level, intelligence level, and 
urban-rural subgroups, were consistently used as subgroups for 
which data were analyzed. The Edwards (1927) classification was 
used to set up socio-economie subgroups based on occupations of 
fathers; available school records indicated the occupations. Avail- 
able school intelligence test scores were used in the formation of 


TABLE 1 
Distribution of Subgroups of the Population by Grade Level 


Grade Grade Grade Grade Grade 


Subgroups 3 5 7 9 11 Total 
All Students 68 119 141 146 121 595 
Boys 32 65 вз 64 652 291 
Girls 36 54 63 82 69 304 
Socio-economic Levels* 

Professional | 0 ISN IS S 29 20 .9 
Managerial 2 12 17 23 21 18 91 
Clerical 3 imt O 32 21 119 
Skilled Labor 4 О + 21^ 35 120 
Semi-skilled 5 13 24 20 21 12 90 

Intelligence Quotients 
116 and above 23 35 PET PEN ROTE ELI 
108-115 15 19 dor во 732 "0144 
100-107 18 Ei ВЕ 881.25. 140 

99 and below 12 SEES 147. 5 Al oy Lv 
Urban 34 6 6 76 59 297 
Rural 34 56 76 70 62 298 


* Based on fathers’ occupations, Does not include 81 students whose fathers work as far- 
mers, farm laborers, unskilled workers, or service workers. ‘These 81 students were not utilized in 
any cases where socio-economic subgroup comparisons were included. They were included in all 
other comparisons. 


RICHARD C. NELSON 743 
intelligence subgroups. Table 1 shows the distribution of subgroups 


of the sample. 


The Instrument 


In order to widen the scope beyond questions concerning what а 
child would like to be when he grows up, an instrument was especi- 
ally constructed for the study. It included colored slides of workers 
in 16 occupations (the number that could be responded to in a typi- 
cal class period), and a questionnaire which inquired into both 
knowledge and interests regarding those same occupations. 

Three principles guided the selection of the occupations for which 
slides were made. First, the occupations had to be available within 
the geographic area of the study since third grade children might be 
more limited than older children by geographie factors that would 
limit potential understanding. Second, the occupations had to illus- 
trate the great breadth of our occupational structure. Merely se- 
lecting the occupations on a frequency basis was not sufficient; 
therefore the varying occupational levels should be illustrated and 
the jobs included should be distinct in skills required. Third, the 
selected occupations had to reflect the three to one ratio of men to 
women in the local occupational structure. 

Listed in the order maintained throughout the study, the 16 slides 
were: janitor, assembler (female), bookkeeper, carpenter, manager, 
teacher (female), farmer, engineer, laborer, sales clerk (female), 
truck driver, doctor, warehouseman, secretary (female), mechanic, 
and telephone lineman. 


Procedures 


Four major areas about each job were explored by each child. 
These were the title of the job, a description of the job, a reaction, 
whether “yes,” “no,” or “not sure,” concerning the prospect of en- 
tering the job when the child was “through school,” and why he 
responded favorably, unfavorably, or neutrally. к 

Children from grade three responded to the slides in an interview. 
Those from grades five, seven, nine, and eleven were surveyed in 
groups, 

After preliminary questions were asked, the lights were turned 
off, a slide was shown, the lights were turned on, and the child was 
asked to respond to all four question areas discussed above. The 


74 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


interviewer recorded responses with third grade children. Above that 
grade the children recorded their own thoughts. Students who omit- 
ted or responded inadequately to as many as one-eighth of the ques- 
tions asked were interviewed briefly. They were asked enough 
questions to clarify their responses or to ascertain that a lack of 
information was the cause of the omission. 

A subgroup sample of ten per cent (59 students) was restudied 
90 to 120 days after their original experience to check reliability. 
The individual interview procedure was used in the retest. 


Analysis of Data 


The titles and descriptions given to the jobs by students were 
assigned a value of from three to zero. Three indicated an exact 
title or insightful description; two indicated a possible title or an 
adequate description; one an omission, unclear response, less possible 
title, or a combination of right and wrong factors in the descrip- 
tion; zero an incorrect response, one not considered possible, 
or such immature responses as “It’s a man.” Three judges 
achieved concensus on all values assigned. Internal consistency was 
judged to be the most crucial objective. The Dictionary of Occupa- 
tional Titles (Federal Security Agency, 1949) was used as a major 
resource in assigning values. 

For purposes of coding on Remington Rand data processing cards, 
values were assigned the reactions or expressions of interest in the 
job. “Yes” was assigned the number zero, “not sure” was assigned 
the number one, and “no” was assigned the number two. Except for 
use in а reliability check, these numbers were not considered to be 
scores, 

Five neutral reasons, 19 positive reasons, and 21 negative reasons 
for reactions to occupations were classified and were assigned num- 
bers which were used only in the data processing to find out sub- 
group similarities and differences in reasons. 


Limitations 
There were many limitations in the study, As an exploratory, 
descriptive study, however, it should be expected to raise many 
questions, more perhaps than it answers. The population sample was 


not randomly selected. The occupation sample, while broad, could 
have been more inclusive, and other ways of selecting such a sample 


RICHARD C. NELSON 745 


could have been used. Face validity was assumed. There were pro- 
cedural differences necessary in the reliability substudy. 

In essence, however, this study has opened a field of investigation 
which needs to be explored for vocational development to be under- 
stood. 

Findings 

Reliability. Reliability, after a 90 to 120 day lapse between test- 

ings, was significant statistically at the one per cent level of confi- 

| dence for both knowledge and interest. The reliability coefficient of 
knowledge scores, which reflected titling and describing consistency, 
was .74. The reliability coefficient of reactions, reflecting consistency 
of interests in the occupations, was .58. This is not a high correlation, 

| but when one considers that only 16 jobs were included in the study, 
it may be hypothesized that a more extensive study might yield 
higher reliability. 

Knowledge. Both chi-square analysis and double entry analysis 


TABLE 2 
Chi-Square Results for Tilting of Occupations by Subgroups 


Socio- Intelli- 
Grade Economie  gence Urban 
Level Sex Level Level Rural 
x x x x £ 
nuc 1 M M 
Janitor 63.07** 1.19 6.72 10.58 6.11 
Assembler 115.43** 13.69* 16.98 17.131 78l 
Bookkeeper 241.57** 2.38 15.35 9.21 11.94 
Carpenter 66.64** 8.93% 14.48 22.02** 10.707 
Manager 71.40** 100 15.97 11.44 20.81 
Teacher 7.74 2.98 9.57 14.67 10.24 
Farmer 47.60% 2.08 20.81  23.10** 11.71** 
Engineer 132 09** 11.31%  21.12*  28.87** 23.36** 
Laborer 73.19** 3.57 2084 44.63** 29.78"* 
Sales Clerk 128.52** 17.26% 22.86" 11.75 28.92** 
Truck Driver — 138.04** 3.57 13.48 18.277 7.69 
Doctor 4.76 ‘00 10.07 5.87 9.01 
Warehouseman 127.33** 00 12.89 22.94* 17.00 
Secretary 91.04* 2.98 16.51 14.37 14.75 
Mechanic 245.14** 4.76 10.42 15.51 7.66 


Telephone 

Lineman 74.97** 18.45% 26.11% 19.48% 5.86 
lm — (957. opc eee 

Degrees of 


Freedom 12 
a idom |.  — 12 eee 


** Significant at one per cent level. 
* Significant at five per cent level. 


746 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


of variance were utilized to ascertain subgroup and mean diff 
ences and as evidence to accept or reject five hypotheses relating › 
knowledge concerning occupations, A small part of the data is sho | 
in Tables 2 and 3 which display, respectively, chi-square results fi 
titling and describing of occupations by subgroups. , 
Table 2 is read as follows. For the job of janitor, grade levels _ 
were significantly different at the one per cent level of confidence 
in their titling information, Tt can be further noted in the table 
that there was a hierarchy of differences among subgroups; the jobs _ 
of mechanic and bookkeeper discriminated most among grade level 
subgroups in their titling accuracy for these jobs. Add the one other 
piece of information not shown, that in all grade level comparisons _ 
the third and fifth grade children scored low and the ninth and- 
eleventh grade children scored high, and the picture is complete, 
Thus, older children exceeded younger children significantly in _ 
titling accuracy for 14 of the 16 jobs. It was also true that older 
children were consistently Superior to younger in job describing 
aceuracy (see Table 3). 


TABLE 3 


Chi-Square Results for Describing Occupations by Subgroups 
rone . Ш 


Socio- Intelli- 
Grade Economic gence Urban 
Level Sex Level Level Rural 


x xt ха x x X 
O aitor o 115 om Т ТЕ 


28.05** 13.93 11.90" 


Assembler 118.41** 20.23* 15 01 19.96*  13.30** 

Bookkeeper 199.92** 208 20.95 17.94* 2.26 

Carpenter 87.47% 238 28.31** 11.30 29.58** 

Manager 55.34** 119 21.43* 4.79 9.92* 

Teacher 16.66 .00 9.47 4.60 .95 

Farmer 14.88 1.19 10.46 10.11 3.34 

Engineer . 120.79** 3.57 14.36 14.40 .23 

Laborer 41.05* 8.93* 1259 8.65 2.69 

Sales Clerk 20.83 3.57 8.12 7.89 3.50 

Truck Driver 32.13** .00 24.05* — 19.40* 3.78 

Doctor 20.83 .60 14.50 5.20 . 18.91** 

Warehouseman ^ 38.08** 4.17 11.40 18.45* 6.71 

Secretary 111.86* 14.88* 1577 6.26 3.21 

Mechanie 23.80* — 119 15.74 3.56 1.57 

Telephone 1 
Lineman 24.99* 7.23 25.91** 16.27 7.18 f 

Degrees of Freedom 12 3 12 9 3 n. 

** Significant at one per cent level, 


* Significant at five рег cent level, 


RICHARD C. NELSON 747 


Neither boys nor girls were consistently superior in titling or 
describing. Boys titled assembler, carpenter, and telephone lineman 
and described the assembler significantly more successfully than 
girls. Girls titled the sales clerk and engineer and described the 
secretary significantly more successfully than did boys. 

The upper socio-economic groups scored significantly higher than 
the lower socio-economic groups in titling and describing for all 
jobs on which significant differences appeared. The one exception 
was that of manager. The occurrence of higher scores for higher 
socio-economic groups, though not statistically significant for each 
job, was consistent enough to result in significant mean differences 
in the double entry analysis of variance when all job titling or 
describing scores were totalled. 

The upper intelligence level groups scored significantly higher 
than the lower intelligence level groups in all cases in which sig- 
nificant intelligence subgroup titling and deseribing differences were 
found. 

Urban children scored significantly higher than rural in all but 
two of the cases in which significant urban-rural subgroup titling 
and describing differences were found. One exception resulted from 
the fact that rural children more accurately deseribed the job of 
manager. For describing the job of doctor the rural children more 
often received the modal score of two, while the urban children 
more often scored at the extremes. 

Considering all jobs, including the results of both the chi square 
and analysis of variance, and combining job titles and job descrip- 
tions as indicative of occupational knowledge, the following hy- 
potheses were accepted or rejected as indicated. Р 

Accepted: Differences in amount of aceuracy of occupational 
knowledge possessed by boys, as compared to girls, are due to chance 
(equal to zero). д 

Rejected: Differences in amount and accuracy of occupational 
knowledge possessed (1) by children of different grade levels, (2) by 
Students of various socio-economic levels, (3) by students of various 
intelligence levels, and (4) by urban and by rural children are due 
to chance (equal to zero). Р 

The order of the four significant variables from high to low ap- 
Peared to be grade level, intelligence level, socio-economic level, then 
urban-rural background. 


748 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Jobs most accurately titled were, in order, doctor, teacher, secre- 
tary, farmer, truck driver, carpenter, mechanic, sales clerk, janitor, 
engineer, telephone lineman, laborer, warehouse worker, bookkeeper, 
manager, and assembler. 

Jobs most accurately described were, in order, farmer, secretary, 
carpenter, janitor, truck driver, teacher, telephone lineman, me- 
chanie, sales clerk, doctor, laborer, warehouse worker, engineer, 
bookkeeper, manager, and assembler. 

Reactions. Both chi-square analysis and analysis of variance were 
utilized to ascertain subgroup and mean differences and to gather 
evidence by which to accept or reject five hypotheses relating to 
reactions toward occupations. Table 4 shows chi-square results for 
reactions to occupations by subgroups. 

Table 4 is read much as Tables 2 and 3. In their reactions to the 
job of janitor, for example, grade level groups were significantly 
different at the one per cent level of confidence. Again there can be 
observed a hierarchy in which the greatest differences among grade 


TABLE 4 


Chi-Square Results for Reactions to Occupations by Subgroups 
سال ج ججج‎ 
Socio- Intelli- 


Grade Economic gence Urban 
Level Sex Level Level Rural 
x х? x x x 
Janitor 32.13** 2.74 10.49 12.92 6.06* 
Assembler 39.87** 25.49** 7.82 1.82 2.77 
Bookkeeper ^ 23.80** .87 1.86 2.00  8.80* 
Carpenter 35.11**  140.20** 12.77 11.70 9.02* 
Manager 20.23** 14.70** 5.85 7.17 11.59** 
"Teacher 50.58** 63.02% — 25.17** 3.95 1.46 
Farmer 24.40** 42.74% 22.61 3.75 5.83 
Engineer 7.14 41.83** 5.03 6.11 5.99* 
borer 22.02** 98.66% 15.69. 16.44% — 9.12* 


Sales Clerk 29.10**  118.65** 26.44% 3.11  13.09** 
Truck Driver — 14.88 98.36%  25.38** 16.20. 7.21* 


Doctor 10.12 .42 14.15 3.21 3.97 
Warehouseman 21.42* §9.72**  16.83*  17.50*  11.78'" 
Secretary 24.40** à229.43**  32.31** 12.10 8.42* 
Mechanic 11.31 223.56** 14.76 13.33 7.66 
Telephone 

Lineman 15.47 86.99** 5.63 4.33 .33 
Degrees of 
Freedom 8 2 8 6 2 


** Significant at one per cent level. 
s it at five per cent level. 


RICHARD C. NELSON 149 


levels were seen for the jobs of teacher, assembler, janitor, and on 
through the chi-square differences from high to low. 

One factor, the direction of the differences, remains to be dis- 
cussed since it is not shown in the table. 

In all cases where there were significant differences in grade level 
subgroups in reactions to jobs, the third and fifth grades proved to 
respond positively more often, while the ninth and eleventh grade 
children responded positively less often. 

The direction of sex differences may be summed up in the fact 
that boys exceeded girls in numbers of positive reactions for all 
jobs in which significant differences appeared except those of as- 
sembler, manager, teacher, sales clerk, and secretary. 

The socio-economic subgroups were consistent in that those from 
the highest levels responded positively less often, while those from 
the lowest levels responded positively more often. 

Intelligence subgroups reflected a similar pattern, High groups 
were less positive; low groups were more positive toward occupa- 
tions. This pattern was so consistent, although statistically sig- 
nificant for only three jobs, that when double entry analysis of vari- 
ance was computed for all jobs, mean differences were significant. 

In 10 of the 11 cases of significant differences between urban and 
rural reactions, the rural group exceeded the urban group in number 
of positive responses. Only for the job of engineer did the urban 
Eroüp exceed the rural in number of positive responses. Ў 

Considering all jobs and including the results of both the chi- 
Square analysis and analysis of variance together, the following 
hypotheses were all rejected as indicated. 

Rejected: Differences in the occupational interests held (1) by 
Students of various grade levels, (2) by boys and by girls, (3) by 
students of various socio-economic levels, (4) by students of various 
intelligence levels, and (5) by urban and rural children are due to 
chance (equal to zero). 

Numbers of positive reactions or “yes” responses to the idea of 
holding a job when the child was “through school” were further 
inspected. The order in which all possible subgroups ranked the 
Occupations were ascertained and rank order ek were de- 
rived. Table 5 reports the correlations for all subgroups. 

Table 5 bón ad the correlation for boys of grade three and 
grade five, based on their relative preference for the 16 occupations, 


750 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


was .88. This table shows some extremely important findings. The 
similarity in ranking the 16 occupations achieved statistical signifi- 
cance in all pairs of subgroups for which correlation was sought 
except at two points. Boys and girls in the sex comparison ranked - 
the jobs in а manner that resulted in a low negative correlation. 
The correlation of .42 in the socio-economic comparisons was not 
significant. The true import of the table, though, lies in the fact that 
subgroups tended to rank the occupations very similarly. 

One hypothesis had been constructed which dealt with this area 
of study; on the basis of the information in Table 5 it was reacted 
to as follows. 


TABLE 5 
Rank Order Correlations of Positive Reactions for All Subgroups 


Grade 5 Grade 7 Grade 9 Grade 11 


Subgroups Boys Boys Boys Boys 
Grade 3 Boys .88 .79 .70 .79 
Grade 5 Boys .84 .58 ‚76 
Grade 7 Boys 74 .86 
Grade 9 Boys 


.82 
Grade 5 Grade 7 Grade 9 Grade 11 
Girls Girls Girls Girls 


Grade 3 Girls .82 .79 .84 -79 
Grade 5 Girls .95 .96 .94 
Grade 7 Girls .95 .94 
Grade 9 Girls .96 
Socio-economic yar, Level 2  Level3  Level4 Level 5 
Professional 0 .87 .78 .42* .07 

anagerial 2 .94 .52 .82 
Clerical 3 .57 .80 
Skilled Labor 4 ‚72 
Semi-skilled 5 = 
porre 

otients 108-11 d below 

116 and above .93 : "rw ә 8 
108-115 “84 ‚70 
100-107 2 Л1 
99 and below = 
Sex Gi 
Boys - зе 
Urban-Rural R 
Urban ry 


* All correlations significant at the five per cent level of confidence except those so marked: 


RICHARD C. NELSON 751 


Rejected: Correlations among pairs of grade levels in their rank 
order preferences for occupations are not statistically significant. 

Reasons. The fourth question was concerned with why occupations 
were reacted to positively or negatively. The five positive reasons 
which were implied or most frequently mentioned by students in 
explaining why they would like а job, and the number of mentions 
follow: some positive inherent aspect, or “I like that kind of job,” 
786; interesting, fun, 170; satisfied with money, 155; satisfied with 
surroundings, 66; altruism, “can help other people,” 63. The ten 
negative reasons which were implied or mentioned most frequently, 
and the number of mentions follow: negative inherent aspect, 821; 
hard work (either mental or physical, often unspecified), 524; not 
interesting, 424; unsatisfied with money, 367; dislike surroundings, 
190; physical danger, 168; prefer other job, 138; waste of learning, 
135; would not like it, 120; sex inappropriate (boy's or girl's job), 
120. 

Frequency of mention of status and pay as reasons for reacting 
to jobs increased significantly with grade level. Frequency of men- 
tion of fear of danger or fear of error decreased significantly as 
grade level increased. Socio-economic level and intelligence level 
showed a positive but not significant relationship to frequency of 
mention of status and pay. 

For the most part, however, reasons which were given for reac- 
tions to particular jobs at one grade level, socio-economic level, or 
intelligence level, were similar to the reasons given by other sub- 
groups. : 


Conclusions 


1. The instrument used was sufficiently reliable to justify its ex- 
pansion and use in further study. 

2. The techniques used were successful in differentiating amount 
of occupational knowledge possessed by subgroups. Titling dis- 
criminated more than did describing. WI 

3. The techniques used were successful in differentiating interest 
in the 16 occupations by subgroups. 4 A 

4. А ан зоб or V. ©. could be established which 
Would point out oceupational knowledge needs at varying levels. 

5. The children showing more knowledge about 1 occupations 
tended to be from the higher socio-economic levels, higher intelli- 


752 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


gence levels, higher grade levels, and urban background. Sex was 
not a significant factor. 

6. Subgroup preferences for occupations varied, but the major 
difference was that some subgroups were more positive toward the 
oceupations in general, while ranking them similarly. 

7. Children showing greater inclination to be positive toward the 
16 occupations tended to be from lower socio-economic levels, lower 
intelligence levels, lower grade levels, and rural background, Sex 
differences were inconclusive. 

8. Sex was the most important factor in determining reactions 
toward the individual occupations. 

9. Rank order preferences for all occupations by sex were in- 
versely, but not significantly, related. 

10. Rank order correlations indicating preferences for occupa- 
tions were positive and significant for all subgroup pairs based on 
intelligence level, grade level, urban-rural background, and for all 
but one of the ten socio-economic level pairs. 

11. Grade to grade ranking of jobs was sufficiently similar to 
justify questioning the assumption that children in the third and 
fifth grade are in a fantasy stage in vocational development. 

12. There are some sex and grade level differences in reasons 
given for reacting positively or negatively toward an occupation, 
but the reasons for which children reacted remained relatively simi- 
lar from grade to grade. It seemed reasonable to draw the implica- 
tion that these children reacted toward these occupations in much 
the same way that adults might be expected to react. 

13. Certain reasons tended to be related to particular occupations. 
For example, altruism related to doctor and teacher; low pay related 
to janitor, This affinity seemed to help establish the idea that chil- 
dren were responding in ways which might be considered rather 
mature since most of the connections were ones that logic would 
dictate. 

14. Substantial indication was given that occupational study 
might profitably be expanded in the elementary school curriculum. 
This would have the advantage of creating concepts prior to а 
thorough internalization of sex typing and socio-economic typing of 
occupations by individuals. 

15. When occupations are studied late in the child's school career, 
much time may be spent by individuals in reconstructing attitudes 
so that some occupations are acceptable. 


RICHARD C. NELSON 753 


16. Because the narrowing process is evident as early as the third 
grade, vocational adjustment problems may be created when form- 
erly rejected occupations must be reconsidered. 

17. It may be hypothesized that an occupation which suddenly 
appeals to the child rarely rises from among formerly rejected occu- 
pations or consistently rejected levels of work. 

18. The groundwork must be laid early if the child is to aspire 
differently from his peer or parent levels of aspiration. There is 
probably some bias in the laboring population against the less active 
occupations, and in the professional level against the occupations 
requiring heavy work. Early reformation of attitudes may be nec- 
essary if distress to the individual is to be avoided when lines are 
crossed. 

19. Walsh (1956) believed that only persons significant to the 
individual are able to help him alter his self concept. Significant 
teachers in elementary school might be expected to have an effect 
by encouraging the child to consider a wide range of what might be 
called “possible-positives” before he comes close to the occupational 
choice point. 

20. Negatives are of great importance in occupational decision 
making; negative responses outnumbered positive responses nearly 
three-and-a-half to one for all of the children in the study. Besides 
limiting the occupations from which choices may be made, it is likely 
that they form points of reference to which newly-encountered 
occupations are compared. 

21. “Possible-positives” among occupations also play a referent 
role. The chances of choices being made from this range and similar 
occupations would appear to be far greater than from a first and 
second choice elicited at one point in time. Strong (1931, p. 6) 
pointed out that, “At any moment, of only one, or at most of а very 
few, of all these interests can he be conscious.” 

22. Sex-appropriateness, an activity of interest, and status appear 
to be the three major referent points by which children evaluate 
occupations. 

23. There would seem to be some basis to hypothesize that the 
occupational choice question may be an encouragement to fantasy. 
Asking an elementary school child what he wants to be when he 
grows up is on a par with asking what college he is going to т whom 
he plans to marry. АП of these questions presuppose more informa- 
Чоп than most young people have at hand. It would appear to be 


754 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


more important to investigate whether the child is involved in the 
process of narrowing the field and whether the process of narrowing 
seems appropriate in relation to the child's intelligence, opportuni- 
ties, and background. 

24. Vocational realism may be better sought, at least through 
junior high school, by asking а child what occupations he rejects 
for himself, rather than asking about his choices. 


Summary 


Much remains to be investigated and written concerning voca- 
tional development. This study has attempted to contribute by sug- 
gesting and, to some extent, testing a method by which knowledge 
and interests concerning occupations can be studied. ‘The literature 
has not previously come to grips with some of the most elementary 
factors in understanding vocational development. 

This study should be utilized as а springboard for research lead- 
ing to further understanding of vocational development as more 
than occupational choice. Often the two concepts have been utilized 
as if they were synonymous. 

This study suggests that the occupational elimination process 
starts early, that occupational attitudes do not await the ninth grade 
unit on occupations, that fantasy in occupational thinking of 
younger children comes partly from the questions asked of them, 
and that relatively irreversible and damaging occupational concepts 
may be internalized because little effort is made to help children 
develop an early and objective understanding of the world of work. 


REFERENCES 


Edwards, Alba M. Alphabetical Indez of Occupations by Industries 
and Socio-Economic Groups. United States Department of Com- 
merce, Bureau of the Census. Washington: Government Printing 
Office, 1927. 

Federal Security Agency. Dictionary of Occupational Titles—Vol- 
ume I, Definitions of Titles. Washington: Government Printing 
Office, 1949. 

Strong, Edward K., Jr. Change of Interest with Age. Stanford: Stan- 

s = 39 Press, 1931. А 

alsh, Ann Marie. Self-Concepts of Bright Boys with Learning 
ro ыш New York: Teachers атм Columbia University; 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 4, 1963 


THE FORM EQUIVALENCE BETWEEN THE WECHSLER- 
BELLEVUE INTELLIGENCE SCALE, FORM I AND THE 
WECHSLER ADULT INTELLIGENCE SCALE + 


CHARLES NEURINGER 
University of North Dakota 


Styce the Wechsler Adult Intelligence Scale (WAIS) has exten- 
sively replaced the Wechsler-Bellevue Intelligence Scale, Form 1 
(WB), as the most important test of intelligence available, it has 
become imperative to ascertain the degree of form equivalence be- 
tween the two tests. After the appearance of the WAIS in 1955 
(Wechsler, 1955) there has been some attempt to study the form 
equivalence of the WAIS with its predecessor the WB (Wechsler, 
1944). Cole and Weleba (1956) had students in their Wechsler 
training course administer both tests to 46 undergraduates. The 
order of presentation was not counterbalanced since 13 students 
received the WAIS first while 33 students received the WB as the 
initial test. They reported correlations of .87, .12, and .52 for the 
Verbal, Performance, and Full Scale 1Q’s respectively. The limited 
range of talent used and the inadequately balanced order of presen- 
tation attenuated the meaningfulness of their findings. Р 

Goolishion and Ramsay (1956) reviewed the WB's of 392 white 
Dsychiatrie patients (190 males and 202 females) and the WAIS's 
of 154 patients (91 males and 63 females). The Object Assembly 
Subtest was not considered in their analysis. They found significant 
differences between the mean scores on the Arithmetic, Digit Span, 
Digit Symbol, Picture Completion, and Block Design Subtests and 
the Performance and Full Scale IQ's. From this, the authors con- 
—— 

1This report is based on a larger study Leser ded ere eae of 


Psychology and the faculty of the Graduate 
in partial fulfillment of the requirements for the degree of Master of Arts. 


755 > 


756 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


cluded that the two forms were not equivalent. However, this con- 
clusion should be treated with extreme caution since, besides the 
subjects being equated only on age and sex, the experimental design 
called for different subjects for the test administrations, thus rais- 
ing the possibility of a sampling bias. 

Guertin, Rabin, Frank, and Ladd (1962) in their recent review 
of the Wechsler literature pointed out that form equivalence be- 
tween the two tests depends upon both correlation and differences in 
scores. They also cited the present study in its unpublished form 
(1956) as fulfilling the requisites of appropriate range of talent, 
sufficient Л, unbiased sampling, and appropriate counterbalancing, 
necessary for evaluation of the form equivalence of the two tests. It 
is therefore felt that the data derived from the present study pre- 
sents a clearer reflection of the state of the form equivalence be- 
tween the WAIS and WB. 


Method 


Fifty-one subjects were randomly selected from introductory 
courses at the University of Kansas during the Spring and Summer 
sessions of 1955. The mean age and standard deviations of the 25 
males was 19.8 and 2.68 respectively. A mean age of 18.7 with & 
standard deviation of 1.78 was found for the 26 females used in 
the study. The difference between the ages was found not to be 
statistically significant. 

The two complete tests were administered to each subject. For 
the WB, the inclusion of the Vocabulary Subtest necessitated pro- 
rating the Verbal IQ scale scores in order to arrive at the Verbal 
IQ's and Full Scale IQ’s. Twenty-four of the subjects received the 
WAIS first while 26 received the WB initially. The sexes were evenly 
distributed in the counterbalancing. One female had only the Verbal 
IQ Scale administered to her because of a severe motor disability. 

In order to overcome the restricted range of talent in our sample 


the IQ scale correlations were analyzed by a method suggested by 


McNemar (1949), wherein the effect of a greater population vari- 
ance could be evaluated on the spread of scores. As the best estimate 
of population variance of IQ in the general population, Wechsler’s 
(1955) report of a standard deviation of 15 IQ points for his 
sampling population was used. 

The weighted subtest scores were used throughout as the basis of 
the computations. 


CHARLES NEURINGER 757 


Results 


Differences between the Performance and Full Scale IQ scores for 
the WAIS and WB were found. No statistically significant differ- 
ences were discovered for the Verbal IQ Scale and the individual 
subtests. Correlation coefficients for the subtests and IQ Scales were 
all significant in a statistical manner except for the Object As- 
sembly Subtest. 


Differences Between Subtests and IQ Scales 


The means, standard deviations, and t-tests of the difference for 
the individual subtests are found in Table 1. The means were 
greater, and the standard deviations were lower, than those found 
in the normal population. The highest means found for the WAIS 
and WB were The Comprehension and Block Design Subtests re- 
spectively. The Digit Span Subtest mean was the lowest on both 
tests. A statistical evaluation of the differences between the cor- 
related means was carried out, and it was found that none of the 
t-tests values were statistically significant. 

The means and standard deviations of the IQ scales are presented 
in Table 2. The IQ scale data was analyzed by an analysis of vari- 
ance technique suggested by Grant (1949) in which counterbalanced 
order of presentation, order of presentation irregardless of test 
(practice effect), and subject differences could be evaluated as well 
as the mean differences between the two tests. The results of the 
three analyses of variance are found in Tables 3, 4, and 5. For the 


TABLE 1 


Means, Standard Deviations, and t-tests of the Differences Between ден и 
the Individual Subtests for the WAIS and WB Intelligence 


8 WB 
Subtest Mean S.D. Mean S.D. t 

lL Subtest ` Mem &D. Меп SD. SMAU 
Information 13.00 2.0 12.62 1.7 .65 
Comprehension 14.82 2.4 13.64 1.8 F 
Arithmetic 12.60 2.4 12.50 3.6 @ 
Digit Span 12.15 3.1 10.62 2.8 ш 
Similarities 12.82 1.5 13.33 2.1 @ 
Vocabulary 12.52 2.2 12.05 LT A 
Digit Symbol 12.48 2.4 13.22 1 es 
Picture Arrangement 12.96 4.1 13.48 2o 4 
ee Completion 18.48 ac dpa is i 
ock Design ae 35 У eH 


758 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 1 


TABLE 2 
Means and Standard Deviations of the IQ Scales for the WAIS and WB 
Intelligence Scales 
WAIS WB 

IQ Scale Mean S.D. Mean S.D. | 
Verbal 119.82 9.1 119.90 8.6 * 
Регїогтапсе 118.82 12.2 123.52 11.4 
Full Scale 120.56 10.3 124.22 9.5 


Verbal Seale IQ's (Table 3), the differences between the test m 
was found not to be significant. The mean for the test admini 
first was significantly different from the mean of the test adminis- 
tered second, thus indicating the presence of a practice effect. The ! 
F ratio between order of presentation was found to be significant. 
However, the subjects within order also differ significantly, and 
when the mean square for order was divided by the mean square 
for subjects within order, the F value of 1.90 was found to be non- - 
significant. Non-significance here meant that the subjects differed 
among themselves when order had no effect. For the Performance | 
Scale IQ's (Table 4), the WB mean was found to be significantly - 
greater than the WAIS mean. The F ratio's representing practice 
effect and the subjects within order also differed significantly. How- 
ever, there was no significant difference between orders of presenta- 
tion. For the Full Scale IQ's (Table 5), the WB mean IQ score was 
significantly greater than the WAIS IQ score. The F ratio for prac- - 
tice effect and for subject differences was significant. There was no | 
significant difference for the order of presentation. 

'The practice effect was greater on the Performance and Full. 


TABLE 3 
Analysis of Variance of the Verbal IQ Scale Scores for the WAIS and WB 
Intelligence Scales 
———Є———————————————_ 
Меап 
Source df Square F F 
Ез PO лаага Бае Р NE 
Tests 1 .15 = 
Practice effect 1 122.98 7.45** 
Order of presentation 1 266.36 16.14** 1.90 
Subjects within order 49 139.74 8.40 
Error 49 16.50 
Total 101 


** Significant at the .01 level of confidence. 


CHARLES NEURINGER 769 


TABLE 4 
Analysis of Variance of the Performance IQ Scale Scores for the WAIS and WB 
Intelligence Scales 
پپپ پپپ ڪڪ‎ www ل(‎ 
Mean 
Source df Square F 

Tests 1 676.00 17.87** 

Practice effect 1 2,830.24 74.83** 

Order of presentation 1 13.20 — 

Subjects within order 48 195.67 5.17** 

Error 48 37.82 

Total 99 


** Significant at the .01 level of confidence. 


Scale IQ Scales than the Verbal IQ Scale. This is not surprising 
Since performance tasks are more vulnerable to practice effects 
than verbal tasks. 


Correlations between Subtests and IQ Scales 


The correlations between the individual subtests on the WAIS 
and WB are presented in Table 6. Spearman's Rank Order Correla- 
tion method (Rho) was used wherever the data did not meet the 
assumptions of the Pearson Product-Moment Correlation technique 
(r). The coefficients range from .81 for the Information Subtest to 
-04 for the Object Assembly Subtest. Except for the Object As- 


TABLE 5 
Analysis of Variance of the Full Scale IQ Scale Scores for the WAIS and WB 
Intelligence Scales 
———————————————_ 
Меап 
Source df Square F 
Tests 1 334.89 20.23** 
Practice effect 1 954.79 57.69** 


Order of presentation 1 37.80 2.28 
Subjects within order 48 165.98 10.02** 
Error 48 16.55 

"Total 99 


** Significant at the .01 level of confidence. 


ty BA Р 

2 The low correlation on the Object Assembly seems to be an ота beue 
differences between the scoring format of the аре xs ү; ahs t of the 
The WAIS scoring technique is much more discriminating than em ter 
WB, allowing for a higher ceiling to be reached on it. А somewhat greater 
Spread of scores was found for the WAIS than for the pEr Obs ei a rie 
Because many subjects achieved a low ceiling score on шүп С На, was 
bly, while varying in their corresponding WAIS scores, the co 


Severely attenuated. 


760 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 6 
Correlation Coefficients between the WAIS and WB Individual Subtests 


Subtest Correlation Coefficient 
Information r S: ا‎ 
Comprehension r .61** 
Arithmetic r .58** 
Digit Span r 110% 
Similarities T 200 
Vocabulary _ r .66** 
Digit Symbol T .51** 
Picture Arrangement rho .44** 
Picture Completion rho .65** 
Block Design r ,09** 
Object Assembly rho .04 


. ™ Significant at the .01 level of confidence. 


sembly all the subtest correlations were significant at the .01 level 
of confidence. 

The uncorrected correlations and their corrected coefficients for the 
effect of population variance on spread of scores, for the IQ scales 
of the two tests can be found in Table 7. The uncorrected coefficients 
range from .77 for the Verbal IQ Seale to .34 for the Performance 
IQ Scale. The Verbal and Full Scale IQ correlations were found to 
be significant at the .01 level of confidence while the Performance 
Scale IQ's were significant at the .05 level of confidence. The corre- 
lation of the Performance Scale IQ's minus the Object Assembly 
subtest was also calculated and a coefficient of .43 was found which 
was significant at the .01 level of confidence. When the Verbal, Per- 
formance, and Full Scale IQ correlation coefficients were corrected 
for the restricted range of talent, they rose to .89, .44, and .77, 
respectively. The uncorrected coefficients account for only 49, 11, 


TABLE 7 


Pearson r's Between the WAIS and WB IQ Scales and Corrected Coefficients 
for Uncurtailed Distribution in the Population 
=—=—=€"0—0ORaRa ———á———— 2224 


Uncorrected Corrected 


IQ Scale Coefficient Coefficient 
Verbal Scale .77*%* .89 
Performance Scale .94* ET 
Full Scale > 6a** 5 77 


* Significant at the .05 level of confidence, 
** Significant at the .01 level of confidence. 


7 


CHARLES NEURINGER 761 


TABLE 8 
IQ Scale Regression Coefficients for Predicting Scores from One Test to the Other 


IQ Scale WB ¬ WAIS WAIS ¬ WB 


Verbal .81 .82 
Performance 36 ‚31 
Full Seale .69 .58 


and 41 per cent of the variance due to the correlation between the 
two forms. 

Regression coefficients were calculated in order to evaluate the 
ability of either test to predict scores on the other form. The regres- 
sion data can be found in Table 8. It would appear that the predic- 
tive power of the Verbal IQ Scale is greater than for the other two 
Scales, regardless of the direction of prediction. 

Discussion 

After evaluating the data gathered in this study, it was found 
that there were significant correlations between the subtests (except 
Object Assembly) and the IQ Scales as well as no statistically sig- 
nificant differences between the two forms, as far as the subtests and 
Verbal IQ Scale scores were concerned. On the other hand, there 
were statistically significant differences between the Performance 
and Full Scale IQ Scales on the two forms and even though the 
subtest and IQ scale correlations were significant, they were low. 

Although there were methodological and design differences be- 
tween this study and previous research, the findings appear to be 
consonant with them. Goolishion and Ramsay (1956) reported Се 
nificant, differences between the Performance and Full Scale IQ's 
on the WAIS and WB for their independent psychiatric popula- 
tions. Both in their's and the present study, the IQ means 2 
the WB were higher than those of the WAIS. Cole and Wales 
(1956) IQ scale correlations, although lower, were in the same direc- 
tion and hierarchal order as those found in this study. 

The low correlations, the significant differences between the Per- 
formance and Full Scale IQ's, the low regression coefficient TE А 
well as the small amounts of variance attributable to the correlation 
between the WAIS and WB seemed to indicate that the form 
equivalence between the two tests is low. 


762 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


The Verbal IQ Scale and its subtests seem to have the greatest 
amount of form equivalence when compared to the Performance and 
Full Seale IQ's and their accompanying subtests. If the rule of 
thumb criteria for form equivalence, which holds that .75 is accept- 
able for group prediction and .85 for individual prediction, is fol- 
lowed, then the uncorrected Verbal IQ Scale correlations meet only 
the group criterion. The other two IQ scales meet neither criterion. 
However, when one considers that the Verbal IQ Scale correlation 
of .77 accounted for only 59 per cent of the variance and that 41 
per cent of the variance arises from undetermined sources, it is 
difficult to credit the Verbal scales as being form equivalent for even 
group prediction. Great caution, if not forbearance, should be exer- 
cised in extrapolating data from one form to the other. 

Actually, the above findings are not cause for dismay. The WAIS 
is by far the better instrument of the two, because of its better 
standardization, greater discriminative ability at the upper end of 
the IQ range, and improved scoring criteria. For these reasons, the 
WAIS should (and already has) replaced the WB. Since Wechsler 
(1955) made a concerted effort to improve his test, it would have 
been a bitter disappointment if the two forms had turned out to be 


equivalent since this would imply that the WAIS still retained 
many of the faults of the WB. 


Summary 


Since the WAIS has been replacing the WB in general use, it has 
become imperative to assess the form equivalence of the two scales. 
Fifty-one subjects were administered both tests in a counterbalanced 
order. Differences between the Performance and Full Scale IQ scores 
for the WAIS and WB were found. No statistically significant dif- 
ferences were discovered for the Verbal IQ Scales and the individual 
subtests. Correlation coefficients for the subtests and IQ Scales were 
all significant in a statistical manner except for the Object Assembly 
Subtest. However, it was felt that the correlations, although sig- 
nificant, were too low to represent adequate form equivalence. The 
significant differences between the Performance and Full Scale 1Q’s, 
low regression coefficients, and the small amounts of variance at- 
tributable to the correlation between the two scales seemed to con- 
firm the impression of low form equivalence. However, it was felt 
that since the WAIS is an improved version of the WB it would 


CHARLES NEURINGER 763 
have been disappointing if the form equivalence had turned out 
to be high. 


t REFERENCES 


Cole, D. and Weleba, L. “Comparison Data on the WB and the 
WAIS.” Journal of Clinical Psychology, XII (1956), 198-199. 
Goolishian, H. A. and Ramsay, Rose. "The WB Form I and 2 
WAIS: A Comparison." pide of Clinical Psychology, XII 

(1956), 147-151. 

Grant, D. ^A. "The Statistical Analysis of a Frequent 40), 119-122. 
Design. ” American Journal of Psychology, ory (1949), 119-122 

Guertin, W. H., Rabin, A. L, Frank, G. H., and Ladd, C. E. “Re- 
search with the Wechsler Intelligence Seales f for Adults: 1955- 
1960." Psychological Bulletin, LIX (1962), 1-26. 

Еа, ©. Psychological Statistics. New York: John Wiley and 

ons, 1949. 

Neuringer, C. “А Statistical Comparison of the WB Intelligence 
Scale; Form I and the WAIS for a College Population.” Unpub- 
lished Master's thesis, University of Kansas, 1956. 

Wechsler, D. The Measurement oj Adult Intelligence (Third Edi- 
tion). "Baltimore: Williams and Wilkins, 1944. 

Wechsler, D. Manual for the WAIS. New York: Psychological Cor- 


poration, 1955. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 4, 1963 


VALIDITY STUDIES SECTION 


Edited by 


WILLIAM B. MICHAEL 
University of California, Santa Barbara 


Comparative Prediction of College Major-Field Grades by 
Pure-Factor Aptitude, Interest, and Personality Measures. 


Jonn W. FRENCH ...2.-.:.55 723 nad ANANN TREE { 


Academic and Personality Differences between Women Stu- 
dents Who Do Complete the Elementary Teaching Cre- 
dential Program and Those Who Do Not. GLENN W. Dur- 
FLINGER А m 

The Validity of the Graduate Record Examinations as Used 
with English-Speaking Foreign Students. Norman C. Man- 
ERLY ......... voe VU О eee 

Use of a Biographical Information Blank in the Prediction of 
Achievement in High School Science. JAMES M. RICHARDS, 
JR., Victor B. CLINE, AND CLIFFORD АВЕ...............++ 

Predictors of Scores on an Employment Counselor Selection 
Battery. LORRAINE D. Evne AND ROBERT 5. WALDROP ...... 

Validation Studies of a Reading Prognosis Test for Children of 
Lower and Middle Socio-Economic Status. Max WEINER 
AND SHIRLEY FELDMANN ....--+eeeeeeeee ee" mE A 7 

Relationship of High School Curriculum Experiences to Coi 
lege Grade Point Average. JoSEPH PAUL Стозтт.......... 

Intellective and Non-Intellective Predictors of Success m 
Nursing Training. WILLIAM B. MICHAEL, RUSSELL HANEY, 
AND ARTHUR GERSHON ...........+ ж е екеж, t r in 

A Study of the Validity of the Programmer Aptitude Test. 
THOMAS C. OLIVER AND Warren К. \їпллв.......у у; ү 

The Comparative Effectiveness of Intellective and Non-Inte 
аи Measures in the Prediction of the Completion of a 

ajor in Theater Arts. JACK MORRISON... . «tnn 

Comparison of the Validities of Selected Test Procedures to 
Predict Shorthand Success. WALTER PAUK .-++-+++++++"* 


765 


767 


775 


785 


789 


799 


807 


817 


831 


СЕ icd iad 


d. 


f^t 


1 
DUNS 


EDUCATIONAL AxD PSYCHOLOGICAL MEASUREMENT 
Vot. XXIII, No. 4, 1963 


COMPARATIVE PREDICTION OF COLLEGE MAJOR-FIELD 
GRADES BY PURE-FACTOR APTITUDE, INTEREST, 
AND PERSONALITY MEASURES 


JOHN W. FRENCH 
Educational Testing Service (Princeton) 


Problem 


This is a multi-variable validity study undertaken to find out how 
useful pure-faetor tests can be for the comparative prediction of 
Success in college fields of study. The decision was made to use 
pure-factor tests in order to insure low intercorrelations among the 
tests and, thus, to make possible the differential validities that are 
necessary for differential prediction. The results of the study have 
lead to the development of a research multi-factor battery intended 
to be useful eventually in academic counseling. Comparative predic- 
tion is the term used when it is desirable to compare the predicted 
level on a criterion in several fields: it represents a desire for both 
the prediction of absolute levels and the prediction of differences. In 
this article (1) the predictor variables will be described; (2) their 
Validities for success in several fields will be reported; and (3) eval- 
uations will be made of the absolute and differential prediction 
obtained. 


Description of Predictor Variables 


The predictor variables included 16 very short “pure” tests of 
aptitude factors, 14 interest measures, and 12 personality scales. 
Estimates of reliability (R) are given for variables 1-28. For vari- 
ables 4-13 and 16 these are alternate form reliabilities from an 
earlier study (French, 1951). For variables 1-3, 14, and 15 they 
àre Kuder-Richardson No. 21 reliabilities computed from medians 


767 


768 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


of the variable means and standard deviations that were found by 
averaging across colleges. For variables 17-28 they are "Coefficient 
Alpha" reliabilities computed from high-school student data. 

In the following list of aptitude factors and the tests used to 
measure them the numbers given after the description of each vari- 
able and before the reliability estimates represent the length of the 
test in items and in minutes, respectively. 


1. General Reasoning: arithmetic word-problems test. 20-10 R= 
.33 
2. Associative Memory: Word-Number Test. 24-6 R = .78 
3. Integration: & very complex following directions test. 10-15 
1151 
4. Visualization: Paper Folding Test. 20-6 R = .55 
5. Mechanical Knowledge: associating pictures of tools. 20-7 
= 67 
6. Number Speed: speed in simple arithmetic operations. 60-2 
R= 80 
7. Space: a test like Thurstone’s Areas. 40-3 R = .80 
8. Speed of Judgment: speed in comparing names of personal quali- 
ties. 100-2 R= 85 
9. Fluency of Expression: completing similes three times each. 
60-4 R= .72 
10. Aiming: speed in drawing a continuous line between other lines. 
1 min. R = 87 
11. Motor Speed: speed in writing digits. 1 min. R = .83 
12. Speed of Symbol Discrimination: cancelling A’s on a page. 2 
min. R = .92 
13. Carefulness: accuracy on speed tests. R — .42 
14. Meaningful Memory: recalling important words in sentences 
studied earlier. 40-11 R = .79 
15. Verbal Comprehension: a vocabulary test. 36-5 R = .64 
16. Induction: selecting 4 of 5 sets of letters which follow some rule. 
20-8 R= .55 


The interest measures were scores based on 100 of the 200 items 
of the Cooperative Interest Index developed by the “8-year study” 
of the Progressive Education Association (Cooperative Test Divi- 
sion of the Educational Testing Service, 1950). The items consist of 
brief statements of activities of a kind familiar to high school (or 


JOHN W. FRENCH 769 


college) students. Responses are "like," "indifferent," or "dislike," 
and the scores are in terms of potential interest in college fields 
which are not necessarily familiar to the students. Twenty minutes 
was required for administration. Twelve scores, as follow, were 
drawn from non-overlapping groups of items, 


17. English R = .80 23. Music R = 83 

18. Foreign Languages R = 86 24. Fine Arts R = .84 

19. Mathematics R = .80 25. Industrial Arts R = .80 
20. Social Studies R = .86 26. Business Courses R = .70 
21. Biology R = .77 27. Home Economics R = .85 


22. Physical Sciences R = .81 28. Sports R= .70 


Two additional scores were derived from some of the same items: 
29. Manipulative Interests and 30. Reading Interests. 

The personality scales were adapted from (1) items found to be 
relatively pure on personality factors in the literature (French, 
1953), and items from the incomplete Personality Research Inven- 
tory (Saunders, 1955). Twenty-five minutes were required for ad- 
ministration of the 96 items. The name of each of the 9-item scales 
Was as follows. 


31. Surgency 35. Self-confidence 39. Personal Foresight 
32. Sociability 36. Persistence 40. Gregariousness 

33. Self-sufficiency 37. Dominance 41. Nervousness 

34. Tolerance 38. Artistic Tendency 42. Emotionality 


Samples. The test battery was administered to a total of 4,833 
students as they entered as freshmen at four men’s colleges, one 
Women’s college, and three coeducational colleges. One of the col- 
leges was a college of forestry; the other seven had liberal arts stu- 
dents, Three had engineering groups and others included divisions of 
business, agriculture, education, and pharmacy. In three of the 
eee а total of 20 freshman courses were studied ranging in size 
ne “A to 540. A total of 59 major-field groups varied in size from 

Criterion measures. The criteria included grades in the freshman 
the nen junior and senior major-field grades, and a rating made by 
E subjects a few weeks before graduation on satisfaction with 

eir choice of а major field. Estimates of the reliabilities of the 
major-field grade criteria were made by correlating junior with 


770 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


senior grades and correcting for half-length. The median of these 
corrected reliabilities was .81. 

Analysis. Separately for each freshman course and major field at 
each college and separately by sex in some instances, tables of inter- 
correlations were computed using all predictor variables, grades in 
the course or field, and, in addition, for major-fields, the satisfac- 
tion rating. The validities for all predictors for each group were 
reported by French (1959). These validities were then averaged 
across colleges for each field. The resulting averages were reported 
by French (1961). Using averaged validities and averaged inter- 
correlations among predictors, multiple validities were computed for 
each field, and evaluations were made of the effectiveness of the 
data for differential prediction. The decision to use “pure” factor 
tests was considered to have been correct, because, while these short 
tests were found to have relatively low validity, differential predic- 
tion was made possible by a median intercorrelation among them of 
only .14, 

Because even the averaged validities are very voluminous, the 
tables in this report present results for only 17 out of the 42 pre- 
dictors, four freshman grade criteria, ten major-field grade criteria, 
and no satisfaction criteria, 

Although most of the presented results look better than the ones 


TABLE 1 
Average Validities of Short Tests for Freshman Grades 


Course: English Math History Biology 


МИЛЕ UM y M PF M EX 
No. of cases: 820 1314 458 87 121 411 262 204 


Test Name 
1. General Reasoning П Ву 27. aT. ла: 22 S 
3. Integration 19 .16 .14 16 O и 13 26 
4. Visualization 492 07 10 16 -1i1 05 22 18 
6. Number Speed 4 o4 26 08 2 12 —09 10 
М. Meaningful Memory .09 18 06 17 08 13 21 18 
15. Verbal Comprehen- 
sion. 40. 35 д7 ло 38 .29 29 5 
16. Induction dE nO a ла 107 35. 3 
17. English Interest 22 15 —01 —05 13 17 .16 —08 
19. Math. Interest 97 —08 11 235 —14 .04 13 48 
20. Soc. Stud. Interest — .10 08 04 (08 15 26 .00 =1 
21. Biology Interest -.01 —02 00 21 13 00 36 46 
36. Persistence 07 11 .07 -06 24 ов 056 Ж 


40. Gregariousness —:03 —16 01 —01 —20 —08 —.13 -% 
АА“ ААА АСА ЕЕ 


JOHN W. FRENCH 771 
TABLE 2 
Average Validities of Short Tests for Major-Field Grades and 
Multiple Validity Coefficients* 
Phys- Home 
Engi- Gov- Bio- Agri- ical Archi- Eco- 
Major Eng- neer- His- ern- Sci- cul- Sci- tec- Busi- nom- 
Field: lish ing tory ment ence ture ence ture ness ics 


Tests Cases: 140 254 140 90 145 267 63 91 117 128 


1.R 32 26 13 22, 47 Ех 
3. In 36 27 15 16  .04 7700 0000 007 ST 
4. Vi 02 07 03 .08 06 .04 .14 12 —04 25 
6. N 10 45 25 —.08 30 E EL 
14. Mm 14 14 04 —01 is ОБО ООЗ 
15. V 49 .24 .37 .25 .02 17 36 01 A 
16. I 18 115  .20 22 „02 | 7 O ene 
17. Engl з .02 24 04  dó OB ce 
19. Math 06 .06 —.10 —.13 —.02 .09 16 .19 —.04 04 
20. S. S 18 07 2141—01 19 АООТ 
21. Biol 08 .05 —.03 .03 .06 .12 .01 —10 .02 —.04 
22. P.S 21 106 —.04 —05'—0ї 02060030 DERE 
24. Art 27 06 02 .14 08 .03 —.02 .38 —.02 —.03 
26. Bus —.16 —.06 —.14 —.16 01 .02 .08 —06  .23 01 
27. H. E 09  .11 —.07 —.08 .02 —.04 —20  .10 .06 —.06 
36. Per 11. .08 24 —02. .14 07 Л ОИ 
40. Gre —326 —.14 —.35 —.10 —.08 —.02 —.21 —23 102 —.04 


Mult. Vale %  .37 52 33 23 29 М 56 41 4 
— — 5а 96 Eee АНАША 
* The validities of the predictors that were used in computing the multiple correlations are 


Printed in bold face. That for Physical Science also used Aiming (No. 10), and that for Architec- 
ture also used Mechanical Knowledge (No. 5). 


not presented, the size of validity coefficients was not а primary 
factor in deciding what to include in Tables 1 and 2. The seven 
aptitude tests presented are logically better suited to academic pre- 
diction than the nine omitted, less superficial, and usually less 
speeded. The interest measures are strictly the ones most appropri- 
ate by name to the major fields. In general, the personality variables 
were not highly predictive of the criterion; the two that appear 
in the tables were selected for inclusion in the tables as the ones 
with the highest (absolute) validities. The criterion fields were 
selected purely on the basis of number of cases. The satisfaction 
criterion had correlations with the aptitude tests and personality 
Measures that were low and very inconsistent from college to col- 
lege. Validities of the interest scores, on the other hand, for the 
Satisfaction criteria were more consistent and much like those to be 
reported for major-field grades, although they ran somewhat lower. 

Validity data. Table 1 gives the validities for freshman grades. 


772 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT P 


ч 

Table 2 gives the validities for major-field grades, The coefficients 
are relatively low because of the extreme shortness of the tests and 
because of restriction in the range of ability within the colleges and 
within the major fields. However, reasonable patterns are apparent. 
The important conclusion for counseling and guidance purposes is 
that, in both Table 1 and Table 2, the interest measures have lower _ 
validities than the aptitude tests on an absolute scale, but they are 
more differentially valid. They have relatively high validities for _ 
appropriate fields and near zero or negative validities for inappro- __ 
priate fields, \ 

Evaluation of the results for absolute and differential prediction. 
Multiple regression coefficients between appropriate sets of four of 
the 17 predictors (not necessarily the four that would produce the 
highest coefficient) and grades in the ten major fields are given at 
the bottom of Table 2. Some specifie figures and a consideration of 
statistical corrections or adjustments is of interest. The multiple 
validities for Physical Science, History, and English were 64, .52, 
and .62, respectively. When adjusted for restriction in range for the 
group enrolled in the major field as compared to the whole class and — 
when estimated for tests of reasonable length (three times as long | 
аз the tests actually used), these multiple validities become .82,.58, — 
and .76. Adjusted for "shrinkage," these figures are .75, .50, and .72. 

When unadjusted correlations were used, the estimated intercor- 
relations among predicted criterion scores had a median of .48. 1 
These correlations should be as low as possible for differential pre- 
diction. A desirable lower value might have resulted from more 
highly differential tests than these and from criteria that were 
more nearly independent, In spite of relatively high estimated cor- 
relations between predicted criteria, the estimated validity of pre- 
dicted differences between the criteria had a median value of .46. 
This is the estimated correlation between the differences that could 
be predicted between pairs of criterion scores for individuals and the 
actual differences that would have been observed for such individ- 
uals. It can properly be called the validity of the differential predic- 
tion (Mollenkopf, 1952). A full report of these computations and the 
formulas that were used wag presented by French (1961). 

Adjustments for restriction of range and for the short length of —. 
the tests were applied to some of the original validities. The esti- 5 
mated correlations among predicted criteria and the validities of 


JOHN W. FRENCH 773 
differential prediction were recomputed. The results for three pairs 
of fields are as follows. 


Estimated Correlations Validity of 
Between Predicted Criteria Differential Predictions 


Not Adjusted Adjusted Not Adjusted Adjusted 
Physical Science 


and History 43 44 .58 71 
Physical Science 

and English 43 48 .63 75 
History and А 

English .62 .64 AT .55 


These figures indicate the reasonable conclusion that differential 
prediction is promising between unlike fields such as Physical Sci- 
ence and History, but much less favorable between History and 
English, which are more alike in that both of them require pre- 
dominantly verbal abilities. In these computations, generally, it 
is the aptitude tests that contribute most to absolute prediction and 
the interest measures that contribute most to differential prediction. 
Both of them, therefore, have their place in comparative prediction. 
With the observed and estimated values given here, it is possible s 
develop expectancy tables for use by students and by their coun- 
selors. When entered with a set of test scores, such tables yould 
give the percentage of students with similar scores who attain OSA 
levels of success in each field and the percentage of students yh 
similar scores who are more successful in one or the other of any 
pair of fields. 


REFERENCES 


9 i Battery." 
French, J. W. "The West Point Tryout of the Guidance Batter 
Research Bulletin 51-12. Princeton, N. J.: Educational Testing 


Service, 1951. А y 
rench, J. W. The Description of Personal Aon 
Kerma of Rotated Factors. Princeton, N. J.: Edu 
ervice, 1953. ; i 
tench, J x W. «Comparative Prediction of Бобо а) 
in College Major Fields. Part I: The Study age J.: Rduenbiánal 
Resulta,” Research Bulletin 59-10. Princeton, N. 4.: 
esting Service, 1959. i i 
rench, 5 x W. "Comparative Prediction of ieu id oia 
in College Major Fields. Part II: Pooling ап 


774 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT - 


and Conclusions." Research Bulletin 61—7. Princeton, 
ucational Service, 1961. 
pec WO f aset of the Problem of Differential 
heo bcd rige рур AND PSYCHOLOGICAL MEASUREMENT, 
2 


Saunders, D. R. Personality ш» Inventory. Prince 
Educational "Testing Service, 1955 e \ 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 4, 1963 


ACADEMIC AND PERSONALITY DIFFERENCES BETWEEN 
WOMEN STUDENTS WHO DO COMPLETE THE 
ELEMENTARY TEACHING CREDENTIAL PROGRAM AND 
THOSE WHO DO NOT 


GLENN W. DURFLINGER 
University of California, Santa Barbara 


Introduction 


Colleges and universities of all types and programs throughout 
the nation appear to be moving in the direction of institutional se- 
lection from among the students who seek entrance into the profes- 
sional curriculums. In the fields of medicine, law, pharmacy, nurs- 
ing, and engineering particularly, this selective process has been 
going on for considerable time. For a variety of reasons the selection 
of students has been much slower in the teacher preparation depart- 
ments, institutions, and schools. Nevertheless, there is а continuing 
Series of researches on various problems related to the validity of 
tests, of academic grade point averages, and of interviews, for ex- 
ample, which are or could be employed in the recruitment and 
Selection of the best, candidates for the profession of teaching and 
in the rejection of the other candidates. 


Background of This Study 
^ The Study reported here is part of а more extensive investigation 
in the realm of teacher-candidate selection and in the prediction 
ОЁ success in teaching. The complete study involves the utilization 
of nine tests or inventories administered to а total of 464 freshman 


775 


776 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


and sophomore students at the University of California, Santa 
Barbara. An interview, a questionnaire, and an autobiography were 
also employed with a portion of these students. There is a total of 
more than 150 subtests, Scores, and data items being validated. 

In the criteria area there was developed locally a teacher rating 
seale with seven sub-sections to be applied to all students during 
student-teaching and during their service in a teaching position after 
graduation. Also, there are the criteria of student-teaching grades, 
of grades in procedures (methods) courses, and in other profes- 
sional courses, and of total grade point averages which have been or 
are to be utilized as criteria, 


Purpose of Current Study 


This specific study involves a comparison on certain personality 
and intellectual traits of four groups of the initial 464 freshman and 
sophomore students. Women students at the University in the fall 
semester 1959, spring semester 1960, and fall semester 1960, who 
for at least one year indicated an interest in teaching in grades 
kindergarten through eight, were included in the study. By the 


Spring semester 1963 they divided themselves into four groups as 
follows. 


A. Those who pursued a teacher preparation program to student- 
teaching in the senior year. 

B. Those who selected this program, but were unable to main- 
tain sufficiently high grades to remain in the University. 

C. Those who selected this program but at their own volition 
transferred to another major and remained in the University 
to the senior year. 

D. Those who for Teasons other than academic grades withdrew 
from the University within three years. (The University has 


no record that they ever pursued college or university work 
elsewhere.) 


у m GLENN W. DURFLINGER U 


je reasons that only female students are included in this study 
+ (1) there was a limited number of men who were available for 
n; (2)'in the Strong Vocational Interest Blank the male and 
scales are best considered separately, since there is a limited 
t of overlapping; and (3) there is the possibility of confusion 
k of refinement in the results and generalizations on person- 
y traits, when the two sexes are combined. 


Tests and Inventories Employed 


Students in these four defined groups were administered the 
Tests of Educational Development (ITED), American Coun- 
n Education (ACE) Psychological Examination, Minnesota 
er Attitude Inventory (MTAI), California Psychological In- 
(CPI), Strong Vocational Interest Blank (SVIB), and the 
on Personal Adjustment Inventory (HPAI). 
e measures of educational achievement employed in the study 
ded the first seven tests of the ITED which are: Understanding 
Basic Social Concepts, General Background in the Natural Sci- 
„ Correctness and Appropriateness of Expression, Ability to 
Do Quantitative Thinking, and Interpretation in the three areas of 
1 Studies, Natural Sciences, and Literary Materials. 
_ The ACE Psychological Examination is a measure of intellectual 
wer as required by the academic curricula of most colleges and 
Versities. It consists of two parts, quantitative and linguistic, 
ich may be combined into a total score with greater weight being 
to the linguistic fraction. ў 
The МТАІ is designed to measure attitudes toward teaching via 
aspects of education and child development. These are: (1) 
status of children; (2) discipline and problems of conduct; 
9) Principles of child development and behavior; (4) principles of 
ication; and (5) personal reactions of the teacher. In this study 
total score was used, regardless of the relative size of the five 


е following 18 characteristics were assessed by the CPI accord- 


to the four classifications given. t 
Plass I. Measures of Poise, Ascendency and Self-assurance. (1) 


778 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


"9vjsreg әшоң 
= 3119—36 ‘ошт 
= SUOTjU[ON ‘S10 
(20)— &ymquiog 
_ әәџәруиод) 


NETTE 
тт + 
ЕЗ 


“OSB WAT 
USN 
пзовАчА 
"Qoa QUT 


18 
8 
а 


(10) 


= 

+ 

E 

T 

= 19»)- 
Pipe 

т: 

+ 

+ 


LEY 
g 
l 


8 

| 
нн +++! 

2 


TEE | E I+ 
8 
I 
= 
$ 
E 


= 
© 
= 
к 


== зшушцу, “есу IVAH 
(90°) 


= 

Р 

e 

— 
| 


ЛЕ 


ТЇ 6TT0)10)!1 


(10)— (10)— 


eouvopgrusig рив 
sovuq ULVI 


+ 
T 
a 
-+ 
+ 
+ 


ILI +++ 


suvo 
[013009-JJag 
UOonziperoog 
Aymqreuodsogz 
Зшәй-[әл osuog 
eousdooov-jog 
eouosaiq “00g 
Aymquioog 
snqeig “deg 
әоизщшот 1420 


IVIW 


dnoip jo uoroeitq dnorp jo поцоәлс 
»әошәгәјаТ uve дү fo s2upoysubis: pup uot 


T SIS VIL 


į 


GLENN W. DURFLINGER 779 
Class II. Measures of Socialization, Maturity, and Responsibility. 


(7) Responsibility—Re.; (8) Socialization—So.; (9) Self-control— 


8с.; (10) Tolerance—To.; (11) Good impression—Gi.; (12) Com- 
munality—Cm. 

Class III. Measures of Achievement Potential and Intellectual Ef- 
ficiency. (13) Achievement via conformance—Ac.; (14) Achieve- 


ment via independence—Ai.; (15) Intellectual efficiency—Ie. 


T Class IV. Measures of Intellectual and Interest Modes. (16) Psy- 
chological-mindedness—Py.; (17) Flezibility—Fz. 

The Heston Personal Adjustment Inventory has been used in 
similar studies related to the prediction of success in teaching. It 
has sub-tests which purport to measure the following six aspects of 
adjustment. A—Analytical thinking, S—Sociability, E—Emotional 
stability, C—Confidence, P—Personal relations, H—Home satisfac- 
tion. 


Statistical Procedure 


The means and standard deviations were found for each of the 
66 variables for all four groups of students. The null hypothesis was 


| that the mean of the population from which one sample was taken 


was less than (or greater than) or at most (least) equal to the mean 


_ of the population from which the second sample arose. Second, differ- 


ences between means of appropriate pairs of groups were tested for 
statistical significance. In Table 1 the directions of differences be- 


` tween the mean of Group A and the means of each group B, C, and 


D, in turn are indicated; that is, in the first column the mean of 
Group A is compared with the mean of Group B for each variable. 
If the mean of Group A is larger a minus sign is recorded; if the 


| mean of Group B is larger a plus sign is recorded. The mean of 


Group A is similarly compared in the second and third columns with 


_ the means of Groups C and D. 


The means and standard deviations in raw score form were com- 


- Puted, and through use of one-tailed tests the significance of the 


differences between the means was indicated by a plus or minus in 
each column. If a difference was significant at the .05 or 01 levels, 
it is so indicated at the appropriate place in the Table (relative to а 
one-sided significance test). In all other cases the difference, if any, 
tween the means was not significant at or beyond the .05 level. 


LI 


_ placed above Group A. None of the differences was significant. E: 


780 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


1 
Interpretation of Results | 
From the first column of Table 1 in which Groups A and B are 


' 


... compared it is difficult to see from the ITED tests why the students 
_ of Group В are the academic casualties. On four of the tests Group. 


ef? 


B revealed averages below Group A; and on three tests Group В. 


Discussion 1 
. In comparing Group C with Group А on the ITED tests the stu- 
dents who changed major exceeded the students in Group A in five 
of the seven tests. One difference favoring the A group and one fav- 
oring Group C was significant at the .05 level. 1 

Group А in comparison with Group D showed rather consistently: 
Significant superiority. АП of the A Group means were above the D 
Group means, three were significant at the .01 level, and two at th 
05 level. These students of Group D by virtue of their achievement 
averages being the lowest of all four groups were the real academic: 
casualties. While the University had no record that they continued. 
on in higher education, it is doubtful that they did. Their academi 
background was such that it did not encourage continuance, al-. 
though strangely enough, it was not their low university grades that 
forced their permanent withdrawal. | 

In the ACE Psychological Examination the A Group, those be- 
coming elementary teachers, were slightly but not significantly in- 
ferior in mean score to the C Group. They were significantly in- 
ferior in Q, L, and Total mean scores to the students changing 
major. They were slightly inferior in the L factors, but superior in 
the Q factors to the students of Group D who withdrew. 

In the MTAT test the students in the A Group averaged above the 
students of all three groups B, C, and D. The mean differences in 
which Group A was superior were significant at the .01 level relative 
to Groups C and D. It would be logical to conclude that this test 
shows promise of identifying the women students who would be 
likely to complete the elementary credential program with success 
once they had begun. 

A glance at the B, C, and D columns under the CPI results will 
reveal the excessive number of minus signs over positive signs. The 
ratio is 36 to 16. This fact tends to indicate the existence of the posi- 


GLENN W. DURFLINGER 1 781 


tive and desirable personality factors within the A Group to a 
greater extent than in the other three groups. For example, in the 
- first characteristic, Dominance, the scale is designed to assess fac- 
tors of leadership ability, dominance, persistence, and social initia- 
tive. The elementary teacher group tended to excel in these traits, 
although the superiority was not statistically significant except in 
telation to Group D. In the total inventory few of the differences 
between Group A and each of the other three groups were statis- 
| tically significant at either of the levels chosen. 
It is interesting to note that Group A on the Flexibility subtest 
tended to be less flexible, i.e., more deliberate, cautious, guarded, 
methodical, mannerly, and rigid than the B, C, or D Groups on the 
average. 

In the femininity scale of the CPI there is observable the same 
trend found in a similar scale of the SVIB—namely that Group A 
ranked below the other three groups in femininity of interests. This 
was significant at the .05 level with Group B and at the .01 level 
with Group D. 

In the SVIB for women in most vocational preferences there was 
- generally only a slight difference between the groups studied. There 
was one cluster of occupations, however, from which Group D 
tended to steer away as compared with Group A. Significant at 01 
level was Group D's comparative lack of interest in the vocations 
| of the Social Worker, Psychologist, Lawyer, Social Studies Teacher, 
and YWCA Secretary. à 

Group A indicated a stronger preference for the vocation of 
Housewife than did either of the other three groups of women stu- 
dents. In comparison with Group C the difference was significant at 
the .05 level, 4 

There was another cluster of vocational interests in which the 
students of Group A significantly excelled Group С, although it 
did not excel the other two groups. These interests pertained to the 
vocations of the Home Economies Teacher, Dietitian, Physical Ed- 
Ucation Teacher, Nurse, and Dentist. : 

Through the remainder of the SVIB Scale the differences were 
generally slight until that of Musician is reached. Here the elemen- 
tary major women significantly excelled the academic failing group 
4nd the drop-outs. However, the change-of-major group, C, slightly 
excelled Group A in their preference for this vocation. 


782 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


In the Femininity-Masculinity scale the women students in the cre- 
dential program quite consistently ranked in femininity below the 
other three groups. This difference was significant at the .05 level rela- 
tive to Group C. An analysis of the SVIB individual profiles for some 
of the women ranking highest academically revealed a general ten- 
dency toward the center of the Femininity-Masculinity continuum. 

As measured by the HPAI Group A excelled the three other 
groups in the first five traits except in the instance of Group D rela- 
tive to the scales of Sociability and Personal Relations. It may well 
be a significant finding with this instrument that, except for the 
Home Satisfaction scale, all but two of the mean differences indi- 
cated that the A Group was superior. 

In the subtest Home Satisfaction in the HPAI, Group A was sur- 
passed by the other three groups. The differences throughout the 
subtests which were small failed to approach statistical significance 
with one exception. In Sociability, Group A was significantly higher 
than Group B (at the .05 level). 


Conclusions 


1. The women students earning elementary teaching credentials 
appeared to excel, although seldom at a level of statistical sig- 
nificance, the students in the other three groups in what would be | 
judged as desirable personality traits. 

2. In the measures of academic aptitude the women who were 
planning to be teachers tended to rank below the students who 
failed to achieve sufficiently high grades to remain in college and 
those changing to another major. In the case of the latter group the 
differences were significant at the .05 level. 

3. Women students who had earned an academic grade point 
average high enough to continue in the University but who failed to 
do so did reflect a lower level of educational background on the 
ITED tests than did the elementary credential women. 

4. Elementary credential women did tend to indicate greater 
interest in the vocational fields of Psychologist, High School Social 
Studies Teacher, YWCA Secretary, Housewife, Home Economics 
Teacher, and Physical Education Teacher than did the women in 
the other three groups. Furthermore, they were consistently less 
interested in only one field, i.e., Buyer, than were women in other 
groups. 


GLENN W. DURFLINGER 783 


5. The MTAI and parts of the other five tests and inventories 
offer promise as instruments for institutional screening and selection 
of women candidates for an elementary teacher preparation pro- 
gram. 

6. A masculinity-femininity scale would be of assistance in iden- 
tifying the characteristics of groups of women seeking an elementary 
teaching credential. On the average they tend away from high 
femininity of interest scores. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 4, 1963 


THE VALIDITY OF THE GRADUATE RECORD 
EXAMINATIONS AS USED WITH 
ENGLISH-SPEAKING FOREIGN STUDENTS 


NORMAN C. MABERLY 
Harcourt, Brace, & World, Inc. 


The Problem 


The basie problem underlying this investigation was to obtain 
empirical evidence of validity that would justify or preclude the 
use of any part of the Graduate Record Examinations for either 
total or contributory evaluations of the academic potential of the 
large numbers of English-speaking foreign students seeking entrance 
to the educational programs of American graduate schools. 


The Instrument 


The predictor instrument consisted of the GRE Aptitude Test 
yielding separate verbal (V) and quantitative (Q) scores, and the 
GRE Area Tests which measure knowledge and understandings in 
three broad areas of social sciences (SS), humanities (H), and 
natural science (NS). 


The Validation Sample 


The sample groups represented as nearly as possible the charac- 
teristics of the total English-speaking foreign student population in 
the United States in 1961-62. These characteristics included lan- 
guage and cultural backgrounds, geographie origins, curricula 
choices, types of colleges attended, sex, and period of college at- 
tendance, 

Two groups, totalling 252 male students, were randomly selected 
from all English-speaking foreign graduate students who had been 


785 


786 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT і 


enrolled аё the University of Southern California between the у 
1955 and 1961, and who had available GRE scores and criterion 
measures, 
Group I consisted of 104 students from countries where English is E 
the primary language and where the culture follows a pattern simi- 
lar to that found in most parts of Western civilization. Group II 
consisted of 148 students from countries where English is not the 
first language of the population but is the language of secondary and 


TABLE 1 
Measures of Central Tendency, Standard Deviations, and Standard Errors 
of the Mean for One Criterion Variable and Five Tests of the < 
uate Record Examination 


Mean 3.200* 528 553 490 508 518 
: 2.900 357 476 338 373 435 
Median 3.300 514 556 487 518 528 
2.950 349 487 335 363 436 


Standard .514 102 117 99 91 113 

Deviation .504 

SE mean .050 10 12 10 9 11 
.042 6 9 6 6 y 


* Upper row statistics are for Group I, (N = 104). 
TOW statistics are for Group II, (V = 148). 


TABLE 2 
Bivariate Correlations Р 
т = L С ЕИННБВИНЕННОРАНИНИЕНИНИННИНИНИНИНЕН 
Variables Y Q СН NS 
Grade Point Average 13 .10 “Asay .16 .02 


.23* .10 .10 18 .08 
MU Ue 


* Significant at .01. 


TABLE 3 
Multiple Correlational Data 
Grade-Point Average 
Variables R R? k E 
Aptitude Test (V + Q) .135 — .018 .990 .010 


.288 — .054 .973 .027 
Area Tests (SS +H + NS) .176 ‘031 984  .016 


.150 — .023 989 01 
Total Battery .194 .038 .982 .018 


.238 .057 .971 ‚029 Á 
EB BE o 009. ШШЩ 


NORMAN C. MABERLY 787 


higher educational instruction, and where the culture does not fol- 
low the usual Western pattern. 


The Criterion 


The criterion was grade-point average earned over a minimum of 
two full-time semesters of graduate study totalling at least twelve 
units. The weights assigned to the letter grades on the student 
records were as follows: A—four points per unit; B—three points; 
C—two points; D—one point; F—no points. For each student, the 
total number of units attempted was divided into the number of 
points after all no-credit or temporary IN marks had been excluded. 


Findings 


Tables 1, 2, and 3 present mean performance statistics, bivariate 
correlations, and multiple correlations, respectively, for Group I and 


TABLE 4 
Discriminant Function Analysis 

Discriminant Function Values! Toi? 
.773 Verbal (V)* -16 
-130 Verbal (V) .39* 
-831 Quantitative (9) -16 
.481 Quantitative (Q) -15 
-854 Social Science (SS) Ar 
597 Social Science (SS) 2 
135 Humanities (Н) AG ap 
.135 Humanities (Н) 2 
-158 Natural Science (NS) 16 
610 Natural Science (NS) 10 
Aptitude Battery Composite? 1 

407V + .6750 i^ 

.112V — .318Q . 
Area Battery Composite u 

51488 + .131Н — .478N8 794% 


13188 — .644Н — .605NS 
Total Battery Composite 

-354V + .116Q + 27188 + .121H — .113NS Hi 

.10V — .303Q — .57788 — 244H — .378NS_ . 


cm in each individual variable is weighted by the given value, the РЫШЫ f e to 

Ate extreme groups is maximized. $ value. 

The „APY number of weighted variables may be added or lessees трета when 
he combinations shown in the table provide the weighted composi 

"sed for maximum discrimination of the English-speaking foreign student groups. ca 


in is i ^ 
{о upper and lower 27 per cent levels on the OPA SANE battery in separating discrete 


е, Or effectiveness of each weighted variable 


| Upper row statistics show weights and coefficients for Group 1, EC. 
Weights and coefficients for Group II. 
Significant at .01, 


788 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Group II. Table 4 shows Discriminant Function data based on the 
effectiveness of each GRE test or battery in separating students in 
the upper and lower 27 per cent of the GPA distributions. In each 
table there are two rows of data for each entry—upper row statistics 
are for Group I, lower row statistics are for Group II. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 4, 1963 


USE OF A BIOGRAPHICAL INFORMATION BLANK IN THE 
PREDICTION OF ACHIEVEMENT IN 
HIGH SCHOOL SCIENCE? 


JAMES M. RICHARDS, JR. 
Educational Testing Service 


VICTOR B. CLINE 4x» CLIFFORD ABE 
University of Utah 


Because of the more and more difficult problems facing our society 
(eg. international tensions and the population explosion), it is 
becoming increasingly important to discover persons with high po- 
tential for science at as early an age as possible. Since performance 
in science has been shown to be quite complex (Taylor et al., 1963), 
it is also quite important to explore a variety of techniques for 
identifying scientific talent. One such technique which has shown 
considerable promise in adult scientists, is the biographical informa- 
tion blank, Using a variety of criteria of the creativity and produc- 
tivity of scientists, Ellison and Taylor (1962) have obtained original 
validities for empirically derived biographical blank keys ranging 
from the 70’s to the 90’s, and cross-validities ranging from the 40's 
to the 60's. The purpose of the present research was to determine 
Whether or not similar relationships exist between a biographical 
information blank and four indices of achievement in high school 
Science. "The basic procedure was to determine which specific bio- 
graphical items correlated with each of these indices, and from this 
information to develop empirical keys for predicting each index or 
criterion. Since there is no doubt that such a procedure produces 
very high correlations if the key is “validated” on the same group 


с ЭОЕ: 4 
1 This research was supported by a contract between the University of Utah 
and the Cooperative Research Branch, U. 8. Office of Education. 


789 


790 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


on which that key was built (Cureton, 1950), the validity of these 
keys was evaluated in a double cross-validation design. Two anal- 
yses of these criteria were made, one of achievement without regard 
to “ability” and one of achievement with “ability” controlled. 


Procedure 


The Biographical Information Blank. The primary reason for the 
success of biographical information blanks as predictive devices 
appears to be that such blanks sample very broadly from diverse 
realms of behavior and attitudes. Thus, a chief distinction of the 
biographical information blank is that it tends to minimize the 
expression of general response sets. Such a broad sampling also tends 
to minimize the influence of incorrect a priori hypotheses about what 
kinds of biographical information should be related to achievement. 
Accordingly, the goal in constructing the blank for this study was to 
sample quite diversely. The result, of course, is an extremely hetero- 
geneous group of items. 

The Biographical Information Blank used in the present research 
was composed of 300 multiple choice items.? The items basically fell 
into four broad categories: (1) Demographic Information, (2) Early 
Life Experiences, (3) School Accomplishments, and (4) Current 
Attitudes and Achievements, 

"Demographie Information" consists of highly objective back- 
ground data. Questions such as “How old are you?", *How many 
odd jobs, part time or full time, have you held?", and “How old 
was your mother when you were born?" would be included. Items 
such as these remain relatively free of subjective attitudes and 
could be verified by independent sources if necessary. 

"Early Life Experience" items attempt to reconstruct the child's 
perception of his relationships with his parents, siblings, and peers 
as well as their relationships with one another. ^To whom did you 
feel closest, during your childhood?", *How much disagreement, dis- 
cord, or friction have you had with your mother?", and “Which of 
your parents is more to blame for the disagreements between them?” 
are examples. 

“When you prepared for an exam, how often did you ‘cram’ in- 
stead of using other study methods?” and “About what percentage 

? The authors wish to express appreciation to Dr. Calvin W. Taylor and Mr. 


Robert Ellison for permitting use of their bi i id in th 
development of the form used in this mie. ee pee swede 


JAMES M. RICHARDS, JR., ET AL. 791 


of students would you have surpassed if you had done the very best 
you could during your high school career?" are representative of the 
“School Accomplishments” category. 

The “Current Attitudes and Achievements” section includes mis- 
cellancous items relating to future plans, ideals, values, and general 
habits. Questions ranging from “How often do you generally watch 
television at home?” to listing Money, People, Ideas, and Things in 
order of their importance to the individual, are representative of the 
range of items which are included. 

The questions were designed so as to be applicable to both male 
and female high school students, Items with similar content were 
randomly distributed throughout the questionnaire in order to prevent 
the formation of an irrelevant response set, and thus allowed each 
item to be responded to independently. 

Subjects. The sample consisted of 543 students at two high schools 
in à suburban Salt Lake City, Utah school district. These students 
were selected on the basis of having completed at least two science 
courses beyond general science, and were seniors at the time of the 
data collection. Of this sample, 331 students were tested in the 
winter and spring of 1961, and 212 students in the spring of 1962. 
There were 285 males and 258 females in this sample. 

In obtaining these groups, a graduate assistant went to the 
permanent school records and obtained the names of all students 
who had completed the required number of courses. Arrangements 
were then made through the school counselors to test these students 
in groups. All testing was done during regular school hours, and 
Subjects were told that they were participating in a research project. 

In the data analysis, the sexes were treated separately, eS pilot 
Studies on small portions of these data indicated a considerable 
number of sex differences in the biographical correlates of science 
achievement, Pos 

In the data collection, all subjects completed the measuring instru- 
ments in the same order, first the biographical information blank, 
and second a special questionnaire dealing with their particular 
interests in seience. Other information was obtained from the 
Permanent school records and from teachers at & later time. | 

Indices of Achievement. Four criteria of achievement in high 
I Tis ап dis grateful to Dr. Kenneth C. Farrer of Granite School Dis- 


trict as well as personnel in the Salt Lake City Schools for making this sample 
Available, 


792 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


school science were used. The first of these was grade point average 

in all science courses adjusted so that an A = 4.00, В = 3.00, C = 

2.00, and D = 1.00. Variation in the number of science courses taken 

was treated as error in the sense that no adjustments were made to 

correct for it. The second criterion was the score obtained at the 

end of the senior year on the Sequential Test of Educational 

Progress: Science Achievement Test. This score, in the form of аў 
percentile rank, was obtained from permanent school records. The 
third criterion was a teacher rating of “overall performance as com- 

pared to a hundred randomly selected science students.” These 

scores were adjusted so that 100 was the highest possible score, and 

the score for each individual student was the average of the ratings 

given him by all instructors who had taught him in a science 

course. The fourth criterion was obtained from a modification of a 

special form, which was originally developed by the National Merit 

Scholarship Corporation.* On this form Ss were required to respond 

to the following three questions. 

а. What are your major interests in school? 

b. In what fields in the area of science do you have the greatest 
interest? 

c. What scientific problems do you consider to be important and 
of special interest to you? Do you have any notions about how these 
problems might be solved? 

The original intention of the authors was to make detailed break- 
downs of the responses obtained on this form. Unfortunately, how- 
ever, the poor quality of the responses precluded this. As а result, 
the Score on this criterion was an overall rating of “Involvement 
with Science” on a 3 point scale where 3 was the highest possible 
score. For each S, the score involved the average of two independent 
ratings. These ratings were made by two Ph.D. psychologists. 

In order also to provide a Score on each of these four variables 
with "ability" controlled, a California Mental Maturity Inventory 
IQ was obtained for each student from the permanent school 
records. Scores with IQ partialled out were then computed for each 
S on each criterion, using the procedure and equation presented in 
Walker and Lev (1953). The constants for this equation were ob- 
tained separately for males and females, and were based on only 


ed о wish to thank Dr, John Holland for granting permission to use 


O7 ———— 


JAMES M. RICHARDS, JR., ET AL. 793 


the students tested in 1961. (In other words, the 1961 constants 
were used in both the 1961 group and the 1962 group.) 

Thus, in the present study there were eight final criteria. Four of 
these were the basic measures described above and four were the 
same measures, but with IQ held constant by means of partial 
correlation techniques. 

Analysis of Biographical Information Blank. The techniques used 
in building keys for the biographical information blank were devel- 
oped by Taylor and his associates (1961) and by Ellison and Tay- 
lor (1962) in adult scientist groups, and are similar to the techniques 
used earlier by Siegel (1956). A detailed discussion of the rationale 
for this technique is presented by Ellison (1960). Basically, this 
technique involved computing the biserial correlation between each 
alternative of each item in the biographical information blank and 
each criterion. Thus, for a five alternative item and the eight cri- 
teria described above, 40 separate biserial correlations would be 
computed.5 Since the 300 items of the biographical blank were pre- 
dominantly five alternative items, analysis of any one group would 
involve computation of approximately 12,000 biserial correlation 
Coefficients. These coefficients were then used as the basis for build- 
ing empirical keys for predicting each criterion from responses to 
the biographical information blank. А 

In these empirical keys, апу item alternative was included which 
had a correlation of .20 or higher and which was chosen by 5 per 
cent or more of the sample. This arbitrary criterion was used rather 
than a system based on a significance test of the biserial correlation 
Primarily in order to keep the procedures of this study exactly 
comparable to Ellison and Taylor's (1962) study of adult scientists. 
Ttems were keyed with unit weight rather than according to a differ- 
ential weighting system based on the size of the biserial correlation 
Coefficients, since validity studies have shown little or no improve- 
ment in prediction from differential weighting of ateme (Guilford, 
1950, pp. 537-542). This means that in these keys, item alternatives 
correlating plus .20 or greater were given a weight of +1 and item 


` alternatives correlating minus .20 or less were given а weight of 


—1. Thus an individual’s score on a given empirical p pra so 
Dumber of positively keyed alternatives he chose, minus the n 
were made through the courtesy of 


5 All computations for this research d Angel 
estern Data Processing Center, University of California, Los e 


74 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


of negatively keyed alternatives he chose (plus a constant to elimi- 
nate negative scores). In the validation and cross-validation of 
keys, these individual scores were correlated with the appropriate 
criterion scores. 

A double cross-validation study was carried out for each sex 
separately. This means that keys for predicting the criteria built 
on the 1961 sample were cross-validated by applying them to the 


1962 sample, and keys built on the 1962 sample were cross-validated | 


on the 1961 sample. In addition to the cross-validities, original 
validities were also computed by applying the keys built on a given 
group to the same group. The two groups were also combined into 
а single large sample; keys, built using the procedure described 
&bove; and original validities, determined. 


TABLE 1 


Male Sample Original Validities and Cross-Validities for Predicting Four Indices of 
Achievement in High School Science from the Biographical Information Blank 


Original Validities Cross-Validities 
Index of Achieve- Com- Key for 1961 Key for 1962 
ment with IQ Not 1961 1962 bined Group Applied Group Applied 
Controlled Group Group Group to 1962 Group to 1961 Group 
E OO), о 1902 Group to 1902 a 
1. Science Grade 
Point Average .85  .90 .76 .49** .52*% 
2. Science Achieve- 
Tent Test .81 .88 .69 .92** ,98t* 
3. шуш with 
-76 .85 .29** .24** 
4. Teacher I Rating of s Y 
Overall Perform- 
ance in Science .79 .87 .74 .52** .55** 
Index of Achieve- 
ment with IQ 
Controlled 
1. Science GPA .80 .88 жж + 
2. Science Achieve- M a 
ment Test .79 .92 f 
3. Involvement with = q eji 
Science -78 .84 * 
4. Teacher Rating of ia eth 19 
Overall Perform- 
ance in Science .75 .86 66 44** Tibi 


l2‏ و ا ا ب 
*p < .05‏ 
**р < 01‏ 


Note: No significance level i Й 
ipvespuribasy highs is reported for original validities because they are to an unknown: 


s 


JAMES M. RICHARDS, JR., ET AL. 795 


TABLE 2 


Female Sample Original Validities and Cross-Validities for Predicting Four Indices 
of Achievement in High School Science from the Biographical Information Blank 


Original Validities Cross-Validities 
Index of Achieve- Com- Key for 1961 Key for 1962 
ment with IQ Not 1961 1962 bined Group Applied Group Applied 
Controlled Group Group Group to 1962 Group to 1961 Group 
1. Science GPA .83 ‚ВБ £78 .63** .52** 
2. Science Achieve- 
ment Test .83 .83 .70 .56** .50** 
3. Involvement with 
Science .80 .86 ‚74 .98** .39* 
4. Teacher Rating of 
Overall Perform- Ж 
ance іп Science .80 .84 .68 181" ы 
Index of Achieve- 
ment with IQ 
Controlled 
1. Science GPA Ble (EBB BR .28** At 
2. Science Achieve- Я 
ment Test 84) 860067 .09 15 
3. Involvement with 709 
Science 81.1 c IBS oS .15 32 
4. Teacher Rating of 
Overall Perform- 
ance in Science — .77 .78 .61 .30** .29** 
*p < .05 
"p < „01 
Results 


The male sample validities and cross-validities for predicting the 
four criteria of achievement, both with and without 19 controlled, 
аге presented in Table 1. The corresponding female sample VE 
and eross-validities are presented in Table 2. In previous studies of 
biographical blanks, no attempt has been made {о construct keys ке 
Criteria with ability controlled. The reason for this is that it h 
generally been assumed that biographical blanks bird CA 
considerable criterion variance in addition to what is measured by 
ability tests, and that therefore а biographical blank key EREN 
on the raw criterion would, if it held up under cross-validation, also 
have substantial validities against the criterion with ability par- 
tialled out. The extra computational labor involved in aperi 
eriterion with ability controlled is therefore considered unjusti: Я " 
À check on the correctness of this assumption for the pee Н м: 
Was made by cross-validating the keys built on the criteria wi 


76 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 3 


Cross-Validities for Biographical Information Blank Keys Built on Indices of 
Achievement with Ability Not Controlled Applied to Indices of 
Achievement with IQ Partialled Out 


Male Sample Female Sample 
Correlations Correlations 


Key for Key for Key for Key for 
1961 Group 1962 Group 1961 Group 1962 Group 
Applied to Applied to Applied to Applied to 
1962 Group 1961 Group 1962 Group 1961 Group 


1. Science GPA .45** .42** .22** .83** 


Test 11 .18* .18 .24* 
3. Involvement with 

Science 11 .26*• .88** .93** 
4. Teacher Rating of 

Overall Performance 


in Science .36** .47** ,81** .20** 
ا لل‎ 
* < 05 


* < 01 


ability not controlled against the criteria with IQ partialled out. 
Results for both males and females are presented in Table 3. 
Discussion 
The results for the criteria without regard to ability ( i.e., with 
IQ not controlled) indicate clearly that the Biographical Informa- 


tion Blank shows considerable promise as a predictor of achieve- _ 


ЕЩ in high school science, although, of course, predictive validity 
studies must also be carried out before the full potential of the 


Biographical Information Blank сап be assessed. All of the original _ 


validities are characteristically (and spuriously) high. The crucial 
test, of course, involves the cross-validities, and these values are also 
very high for “nonintellectual” predictors, ranging from .24 to .63. 


All of these cross-validities are significant at the .05 level, and all 


but one are significant at the .01 level of confidence, On the whole, 
then, these results are highly similar to those obtained in adult 
scientists groups, and therefore, indicate that the same instruments 


which identify good and poor scientists also identify students who 
do and do not achieve in science. 


The cross-validities with IQ controlled are somewhat less striking, 


and suggest that for some criteria the correlation between the 
biographical inventory and achievement disappears when IQ is con- 


trolled. As is to be expected, these criteria are those involving mainly _ 


JAMES M. RICHARDS, JR., ЕТ AL. 797 


(and perhaps solely) the purely verbal skills measured by the IQ 
test, i.e., the Science Achievement Test criterion and the "Involve- 
ment with Science" criterion which is based on a short written com- 
position. On the other hand, for the more complex criteria (grade 
average and teacher rating), the cross-validities are still substantial 
even when IQ is controlled. In the authors' opinion, therefore, these 
results suggest, overall that as long as there is true score variance 
left in the criterion, the biographical blank will predict a significant 
portion of that variance. 

The results for applying the keys built on the raw criteria to the 
criteria with IQ controlled are quite interesting and intriguing. 
Cross-validities for these keys against these criteria tend to be 
slightly higher than the cross-validities for the keys actually built 
on the criteria with IQ controlled. The reasons for this are not en- 
tirely clear, but perhaps the most parsimonious explanation would 
be that when IQ is partialled out of the criterion, the relative 
amount of true score variance in that criterion is decreased. This 
makes it more likely that the biographical correlates of such а 
criterion are purely random, and therefore less likely that they will 
hold up on cross-validation. However, whatever the reasons for 
these results, the implications are fairly clear. If one is building a 
key in this type of study, very little is to be gained by controlling 
IQ. Therefore, one should simply use the raw criterion scores in 
deriving the key. 

An interesting portrait of high achievers in science can be con- 
structed from the biographical correlates of that achievement. First, 
they correctly estimated their own high ability. In fact, the best 
(or at least most economical) way to find out who your top pes 
scholars are may be to ask them. High achievers also value “money 
less than “people” or “ideas.” Psychological independence would be 
a key descriptive phrase in evaluating them. They also appear to be 
constructively dissatisfied with themselves, and are constantly 


` analyzing their own work, in an apparently healthy way. 


Having a reading or speaking knowledge of a foreign language is 
interest in chess is also 


Significantly related to achievement. An i о à 
related. Family associations with people in the academic and scien- 
tific world are quite important. Considerable reading, especially of 
fiction but also of non-fiction, is also related. à 

A most interesting characteristic of the top science students is 
that they are rather critical of the job their science teachers are 


798 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


doing and frankly consider it “rather poor” (but not “extremely 
poor”). It is uncertain to what degree this judgment reflects objec- 
tive reality, but it does reveal an attitude of critical evaluation 
which they possess not only towards their teachers but toward 
themselves, They also tend to ask a greater than average number 
of questions in their classes, Finally, it is an unhealthy sign for high 
achievement if a student’s mother was less than 20 when the student 
was born. 

A final comment should be made about the results of this study in 
view of this portrait. This comment is that the high achiever is 
revealed by the biographical inventory as a student who is familiar 
with the academic world, and who has internalized the value of that 
world, This means that if the present biographical blank were used 
in the identification of scientific talent, it could penalize the cultur- 
ally disadvantaged even more than do aptitude tests. However, this 
is not inherent in the biographical approach, since one could easily 
apply a biographical blank in some such cultural backwater as 
Spanish Harlem and determine which items discriminated high 
achievers from low among the culturally impoverished. 


REFERENCES 

Cureton, E. E. “Validity, Reliability, and Baloney.” EDUCATIONAL 
AND PsYCHOLOGICAL MEASUREMENT, X. (1950), 94-96. 

Ellison, В. L. “The Relationship of Certain Biographical Informa- 
tion to Success in Science.” Unpublished M. A. thesis, University 
of Utah, 1960. 

Ellison, R. L. and Taylor, C. W. *The Development and Validation 
of a Biographical Inventory for Predicting Success in Science.” 
Paper read at American Psychological Association, St. Louis, 


1962. 
Guilford, J. P. Fundamental Statistics in Psychology and Educa- 
. tion. (2nd. Ed.) New York: McGraw-Hill Book Company, 1950. 
Siegel, L. “A Biographical Inventory for Students: II. Validation of 


the Instrument.” Journal of Applied Psychology, XL (1956), 


122-126. 


Taylor, С. W., Smith, W. R., Ghiselin, B., and Ellison, R. S. “Ex- _ 
plorations in the Measurement and Prediction of Contributions - 
of One Sample of Scientists.” USAF, AFSC, Personnel Lab., 2 


Lackland Air Force Base, ASD-TR-61-96, 1961. 


Taylor, C. W., Smith, W. R., and Ghiselin, B. “The Creative and - 


Taylor and F. Barron (Eds.). Scientific Creativity: Its Recogni- ү. 


Other Contributions of One Sample of Scientists.” In С. W. 


tion and Development. New York: John Wiley & Sons, 1963. 
Walker, Helen and Lev, J. Statistical Inference. New York: Henry 
Holt and Company, 1953. 


ke 


COUNSELOR SELECTION BATTERY + ? 


LORRAINE D. EYDE 


Division of State Merit Systems 
U. S. Department of Health, Education, and Welfare 


AND 
ROBERT 8. WALDROP 
University of Maryland 


Purpose 


An employment counselor selection battery was administered to 
a group of University of Maryland students in order to study the 
battery's relationship to aptitude and background measures, This 
pilot validation study was conducted before the battery was in- 


cluded in a large scale study of employed counselors. 


Predictors 


The predictors used include the American Council on Education 
Psychological Examination for College Freshmen, 1954 edition 
(ACE Quantitative, Linguistic, and Total scores), semester hours 
credit in psychology, and cumulative grade average. The ACE 


was administered at the time of admission to 
Maryland. 


Criterion 


the University of 


The total counselor battery score was designated to be the cri- 


terion. This examination, made up 


1 The opinions and conclusions expres 
Opinions of the Division of State Merit 


of 190 four-choice test items, 


in do not necessarily reflect the 
TREE US. Department of Health, 


Nis ; 
ucation, and Welfare. S. Anderson for her helpful advice 


2 The authors wish to thank Dr. Nancy 
throughout, the project and acknowledge Dr. 
Securing subjects. 


. 
MA TH, anD Руссо NE MEASUREMENT 
PREDICTORS OF SCORES ON AN EMPLOYMENT 
799 


Claude J. Bartlett's assistance in 


800 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


was developed by the Division of State Merit Systems, U.S. De- 
partment of Health, Education, and Welfare, for State Civil Service 
use in selecting local office publie Employment Service counselors. 
The battery has the following subscores. 


Number of Items Content of Subscores 
GENERAL MENTAL ABILITY -+ SOCIAL 
SCIENCE SUBSCORES 
25 Verbal (10 vocabulary, 5 reading comprehen- 
sion, 10 outline completion) 
15 Quantitative (10 computation, 5 chart reading) 
30 Social Science Information (10 government, 10 


economies, 10 sociology) 
SPECIALIZED ACHIEVEMENT SUB- 


SCORES 3 
59 Psychology 
34 Employment Security Policies and Principles 
27 Placement-Labor Economies 
Sample 


The sample consisted of 86 students who were enrolled in psy- 
chology courses. Of these, 64 were psychology majors and 65 were 
upperclassmen (including 32 seniors). Included also were 7 graduate 
students and 13 sophomores and 1 unclassified student. The mean 
number of semester hours the sample had studied psychology was 
18.5. The standard deviation was 14.1. 


Procedure 


The examination was printed in the form of four booklets. The 
general mental ability (verbal and quantitative scores combined) 
and social science subtests were each split in half and were printed 
in two booklets presented in counter-balanced order to four groups 


$ These items were categorized by five psychologists and one Employment 
Security counseling specialist. The criterion for inclusion in the first two cate- 
gories was agreement by four out of six raters, If the subject matter was PSY” 
chological the item was included in the first category and if knowledge of the 
subject matter was likely to be gained through job experience in Employment 
Security the item was assigned to the second category. The remaining items 
were placed in the third category which dealt with job placement procedures 
and labor economics. 


EYDE AND WALDROP 801 


of students. These booklets always preceded the specialized achieve- 
ment subtests. The specialized achievement items, split up into two 
booklets, were presented in a similar fashion. Fifty minutes time was 
allowed for the first two booklets dealing with general mental ability 
and social science information. The testing time for each of the 
remaining two booklets containing specialized achievement items 
was 50 minutes. 
Statistical Analyses* 

An RCA 301 computer was used to compute Pearson product- 
moment correlations between the predictors, the total criterion score, 
and subscores, А multiple correlation between the predictors and the 
total criterion score was also calculated. 


Results 
Table 1 shows that the size of samples on which the mean and 
standard deviation of the predictors and the criterion were computed 


TABLE 1 


Mean and Standard Deviation for the Predictors and the Criterion for 
Product-Moment Correlations and Average Difficulty of 


Counselor Battery Items 
د‎ 
Average 
ШИА:  —— — 
PREDICTORS 
АСЕ Q-Score LEM CO 
ACE L-Score BOE ME 
ACE Total 5 "6$ . e 
Credits in Psychology 83 18.5 un 
Grade Average 82 ied j 
CRITERION 
"Total Counselor Battery 63 115.2 i etd 
General Mental Ability 83 24.8 3 67% 
Verbal E оо 
Vocabulary 83 6.3 2:2 69% 
Quantitative 83 9.3 37 74% 
Social Science Information 83 m 5.8 64% 
Psychology 63 87.0 33 58% 
mc Security 63 18.9 ў 
1 
omnis у 14.8 2.8 66% 
varinble are given. 
Note: The mean and standard deviation scores for the maximum N for enc Р value for each 


(reg Subjects who omitted an item were not included in the 


—— 
* Three separate item analyses were made on the upper an h a 
the sample ОН according to the total battery score, appropriate su 


Scores, and grade average. 


802 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


“PAS LO" уз 3uvogm?ig gg 
“PAST CO" 3v FUVOYTURIS e 
"(өсөт 'p1ojrnp) exeudouddw awqa pondde sua 
w[nur10j пот}оәллоо uonv[olr09 epoya-qred eq ‘£9 < N eu? suon[elro eqs Jo Ajuo[wur BY} 10, “CY оў OF шолу SIVA Suonv[a1100 eqs 107 N OY} jo orm eq [MON 


somuou 
„оо GUT USUI, "VI 
Lv Ayundeg учәшАоүаш "І 
+08 S6 АЗоүоцоАв 'cI 
+499" oo" as IP” поп 
“SLOJI 9ouopg PPOR ^Tt 
ЖаШ, — Y0 — — 6L REE eanvinuvnb ОТ 
*08° 4860 +16 «8b SBT ÁAre[nqvo0 A 6 
+96 48S «+68 «6F LT Sr TIERA `8 
«PF LU — «408 509 27 SS AU хуу 
гезчәрұ pewny д 
«Sf +9087 salb’ 529° «IE «067 EP, ae BP A1934 10o[osuno;) үєўо], `0 
NOIHILTHO 
EU GG *1ё` SU 90 +s8E° 489 «€ «15 әЗвләлү opp) ‘$ 
ZU OU — VV /— «08 «GG ++/©` «FE «LE sbb «IG ABojoyoAS ш SPA Ф 
Pé’ 98° SZ" EG 97 „28 98 +26 «tb IG «6G TOL HOV '£ 
#б` +96 *86` 208° 80 „2 «66 — 406 «9v» O FG 88 8109S“T DOV С 
87 9 0= 90 +82" 67 «LG ea LE" S6 LU #6 88 88 91008-O GOV "T 
SUOLOIGHd 


SILOISQNG u0142]14/) PUD U01114) PUD S40Jopo4qp usemjog 8017012110) иэшорү-)опрола чото 
с WISVIL 


EYDE AND WALDROP · Fs 08 


is variable. ACE scores were not available for transfer students and 
graduate students. Only 63 students were present during all three 
testing sessions in which the counselor battery was administered. 
The general mental ability and social science information scores 
tend to be underestimates because these subtests were more speeded 
than were the specialized achievement subtests. 

Table 2 shows the intercorrelations between the criterion sub- 
scores and total criterion score. All of these intercorrelations were 
significant at least at the .05 level. The highest intercorrelations 
were between the total battery score and social science information, 
placement-labor economics, and vocabulary. The total counselor 
battery scores show significant correlations with all other predictors 
except for the ACE Q-score. 

Among the criterion subscores, the social science information, 
placement-labor economics, and psychology scores show the highest 
interrelationships with other counselor battery scores. All three of 
the subscores were significantly related to six out of the seven other 
subscores. All of the predictors correlate significantly with the gen- 
eral mental ability subscore of the counselor battery. 

The ACE L-scores and number of psychology credits were the 
best predictors of the counselor battery subscores. Both were related 
to six of the eight criterion subscores at а statistically significant 
level. 

Table 3 shows the total battery and specialized achievement 
scores for subjects who have and have not met certain counselor 
course requirements. These requirements, developed. by the US. 
Bureau of Employment Security for use by states in hiring employ- 


TABLE 3 


Mean and Standard Deviation and t Scores (Ratios) for Counselor Battery Related 
to Employment Security Counselor Course Requirements 


Specialized Achievement Subscores 


Total mme 
Ө Labor 
Employment | 
Battery 
Requirements Psychology ja о 


3 15.0 2.8 
j 3 1 1 


Yes (N = 24) 119.8 13.9 39.3 52 
No (N = 37) 111.7 15.4 35.4 5.8 Ug m 


* Significant at .05 level. 


804 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 4 


Mean and Standard Deviation of Predictors and Criterion for 
E ac —— A а Correlation, М = 35 


) 
meas от 0с __ са 
mo Т АСЕ ше ^7 4r: та СИ 
0. 
Credits in Psychology 0. 
Grade Average 2. 
Total Counselor Battery 112. 14. 


ment counselors, call for completion of at least 15 semester hour 
credits in three out of five specified areas of guidance and psy- 
chology. Students who met these Employment Security counselor 
course requirements scored significantly higher on the total counselor 
battery and on the psychology subtests than did those who did not _ 
meet the requirements. No significant differences between these two | 
groups were found on the Employment Security principles and poli- _ 
cies and the placement-labor economics subtests. 

A multiple correlation coefficient predicting total counselor bat- 
tery scores was obtained for 35 students, for whom complete data 
were available. A comparison of the mean scores and standard devi _ 
ations for the predictors and criterion for the total sample and the 
sample on which the multiple correlation was computed (given in 
Tables 1 and 4) in general shows little difference between the 


TABLE 5 


Product-Moment Correlations and Multiple Correlation Predicting Counselor 
Battery Score, N = 35 


9 8 800QxQ a a amau u uaua 


А Zero Order Multiple 
Variables Correlation Correlation 
1. ACE Q- Score CRUEL 
Roa = 59m" 
Рза = .55* 
2. ACE L-Score n (oRs.23 y 
Rss = p" 
an = 54 
3. Credits in Psychology Ти = AT („Ён 
Ryan = po 
4. Grade Average "e jo (Rsi = -~ 


5. Total Counselor Battery 


* Significant at the .05 level. 
** Significant at the .01 level. | 


EYDE AND WALDROP 805 


samples. However, the students included in the total sample have 
completed more credit hours in psychology than have those in the 
smaller sample. 

The multiple correlation, R 5.23 = .590, was significant at the 
01 level. Thirty-five per cent of the variance in the counselor bat- 
tery score can be accounted for by the variance of the ACE L-scores 
and the number of psychology credits. When the multiple correla- 
tion based on these two predictors was corrected for shrinkage 
(Guilford, 1956), it was still significant at the .01 level. 


Conclusions 


Relatively high verbal aptitude and а knowledge of psychology 
appear to be the major factors related to high scores on the 
counselor battery. Knowledge of social science also contributes to 
success on this battery. On the other hand, quantitative ability 
(ACE Q-score and quantitative subscore) and knowledge of Em- 
ployment Security policies and principles contribute less to the 
total criterion score. These findings suggest that the counselor selec- 
lion battery identifies the kind of persons employment service 
agencies are seeking to recruit, ie., persons with college training 
who have above average verbal ability and a knowledge of psycho- 
logical principles related to counseling and guidance. 


REFERENCE 


Guilford, J. P. Fundamental Statistics in Psychology and Educa- 
tion, (3rd ed.) New York: McGraw-Hill Book Company, 1956. 


LJ 
fuo i 
» і - 
т e ELA | 
Оа? б) 
db nid. sd дыл, і 1 1 
* ч ы 1 
reed } TTA, 
uir ive А 
a 


| 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 4, 1963 


VALIDATION STUDIES OF A READING PROGNOSIS TEST 
FOR CHILDREN OF LOWER AND MIDDLE 
SOCIO-ECONOMIC STATUS! 


MAX WEINER? 
Brooklyn College 


AND 
SHIRLEY FELDMANN 
New York Medical College 


Purpose 


These studies were undertaken to determine whether a Reading 
Prognosis Test could be constructed to measure future reading 
ability based on present skills and knowledge of children from 
different Socio-Economie Status levels (SES). The literature re- 
ports a number of valid reading readiness tests. One major short- 
coming of these tests, however, is the type of norms used; that is, 
too few children representing Lower SES are included. In previous 
work with children from Lower SES levels it became apparent that 
Standardized tests did not yield appropriate interpretable scores 
for children at the Lower SES levels (Feldmann and Weiner, in 
press). 

In addition, no clear rationale of underlying skills is presented 
in most reading readiness tests. In general, they attempt to measure 
global skills and yield a predictive score on that basis, but they do 
not provide a differentiation of the child's present skills. Therefore, 
the present Reading Prognosis Test was constructed in an attempt 
to correct for both of those deficiencies. 

Ете тегей reported herein was partially реге E. m: pos 


York State Mental Health Authority, Research Grant N 

tially supported: x ET Grant No. MH 820, from ey of 

Health, Education, and Welfare, National Institutes of Heal ud МЫН 

coi consultant: Institute for Developmental Studies, New York Меси 
ollege, 


807 


808 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Method 


In order to facilitate the study of children from different SES 
groups, а balanced sample was drawn, including equal numbers of 
children from Middle and Lower SES, and equal numbers of Negro 
and white children. 

SES was determined by use of the Institute for Developmental: 
Studies’ Socio-Economic Scale which takes into account occupation: 
of adult members of the family, number of rooms in home per 
occupant, and educational level of adult family members.? 

The test itself was constructed not only to yield a global readi 
predictive score, but also to provide scores in three areas which 
would identify the present status of reading skills and of skills 
underlying reading.“ The three areas measured are: Language, Per- 
ceptual Discrimination, and Beginning Reading Skills. In two of 
the areas, Perceptual Discrimination and Language, the items were 
designed to measure levels of skills which underlie both beginning. 
and advanced reading; these skills are considered basic to general 
reading achievement. 4 

The third area, Beginning Reading Skills, measures the p: 
status of the child in reading. Since skills measured in this area. 
require direct use of learned facts, they therefore may be indicative - 
of beginning reading ability. However, if such facts were memorized 1 
without understanding of the basic skills underlying them, their 
transfer to more advanced reading skills might be inhibited. 

The three major areas are divided into eight subtests. A Word 
Meaning test and a Storytelling test make up the subtests in the 
Language area. The Word Meaning subtest measures expressive 
language of the child by asking him to define words. The Story- 
telling test is a measure of the child's skill in constructing and tell- 
ing a story from a series of related pictures. 

In the Perceptual Discrimination area the three subtests are 
Visual Similarities, Visual Discrimination, and Auditory Discrimi- 
nation. The Visual Similarities and Visual Discrimination tests Tê 
quire matching of three and four letter words. The Auditory Di 


— ÀÓ—— 1 
5 See Institute for Developmental Studies Memo # 11 for a description of 
the Socio-Economic Scale. M 


‘The original items of the test were constructed by Shirley Feldmann, Ida 


Ed M and Virginia Graff, Institute for Developmental Studies Memo 


WEINER AND FELDMANN 809 


erimination test requires the child to judge whether spoken pairs of 
words are the same or different. 

The subtests in the Beginning Reading Skills area are Small 
Alphabet Letters, Capital Alphabet Letters, and Sight Vocabulary. 
The Alphabet Letter subtests require identification of the letters by. 
the child. The Sight Vocabulary subtest includes primer words and 
words commonly found in the child's environment, all of which must 
be identified by the child. 

Since the Reading Prognosis Test was designed primarily for use 
with young children at the end of the kindergarten or at the begin- 
ning of the first grade year and since its results are to be used for 
diagnostic purposes, it was decided at the outset that the test would 
be an individually administered one. 


First Validation Study 


The initial pilot study of the test with a group of 40 children was 
encouraging enough to motivate a relatively definitive study of the 
test. Even with the limitations of a small sample and with adminis- 
tration of the test rather late in the school year, the Reading Prog- 
nosis Test correlated .87 with the Gates Primary Word Reading 
Test of 1958, Certain improvements were made in the format as а 
result of the pilot study. 


Second Validation Study 


For the second study a larger N in a balanced sample was selected 
from six schools in New York City. Bases for selection were four- 
fold: socio-economic status, race, sex, and grade. In all, 138 pupils 
were selected. Table 1 summarizes the distribution of the sample 
by SES, race, and sex. j 

The Reading Prognosis Test was administered in October of the 
school year within a two week period. In June of the school year, 
two reading achievement tests were administered to the children of 
the original group who were present. The tests used as the validity 
criteria were the Gates Primary Reading Tests: pomena Reading 
and Paragraph Reading (1958). A total of 126 children, who were 
Present for both the October and June testing, comprise the sample 


for the present study. 
С + G TO 


с oses of 
5Each child was given the test again three Sig ра dm Total group. 


determining test-retest reliability. The reliability was 


810 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 1 


Distribution of Children by Socio-Economic Status (SES), Ser, and Race ш 
(N = 138) 


Middle SES Lower SES Total 


Negro Mal 15 Male 19 
Female 15 Female 20 


30 39 69 


White Male 15 Male 18 
Female 15 Female 21 


30 39 69 
Total 60 78 138 


The means and standard deviations for the major groupings, by _ 
SES, race, and sex were calculated for the Reading Prognosis Total 
test score and the subtest scores. The total test mean for the sample 
of 126 children is 59.07, with a SD of 21.65. The means range from a 
high of 77.40 with a SD of 18.08 for the Middle SES Negro group, 
to a mean of 37.47, with a SD of 17.10 for the Lower SES Negro 
male group. The two major groups by SES, Lower Class and Middle 
Class, earned significantly different scores on all subtests as well as 
on total scores.* However, there was an adequate variability among _ 
Lower SES groups to provide discrimination data at the lower end 
of the scale. 1 

Tables 2 and 3 show the intercorrelations among the subtests and 
also correlations of the subtests with the Reading Prognosis total 
test score according to Middle SES (МС) and Lower SES (LC) 
groups. 

Inspection of these tables reveals about the same degree of rela- 
tionship for correlation of subtests for both Middle and Lower SES 
groups. The correlations are relatively high, except for subtests 7 
and 8. Greater differences in correlations are found between Middle 
SES and Lower SES groups for subtests 4 and 5, Meaning Vocabu- 
lary and Visual Discrimination. 


A stepwise multiple linear regression was undertaken in order to 


ёр < 01: Total score; also Small Letters, Capital Letters, Meaning Vocabu 
lary, and Primer Vocabulary subtests, E 
, P < 05: Visual Discrimination, Visual Similarities, Auditory Diserimi 
tion, and Storytelling subtests. 


WEINER AND FELDMANN su 


TABLE 2 


Intercorrelations of Reading Prognosis Total Test Score and Subtest Scores, 
for MC Group (N = 54) 


Total 
Subtests Score 
Subtests 1294 3 4 5 6 7 8 
1. Capital Letters .896 .416 .492 .452 .602 .137 .184 .847 
2. Small Letters .942 .352 .360 .489 .097 .152 .727 
3. Sight Vocabulary .264 .363 .319 -—.125 .108 .556 
4. Meaning Vocab- 
ulary .946 .489 .036 .081 .684 
5. Visual Discrimi- 
nation .728 .165 .160 .701 
6. Visual Similari- 
ties .M8 .205 .822 
7. Auditory Discrim- 
ination .070 .217 
.370 


8. Storytelling 


determine the degree to which each of the three major areas of the 
Reading Prognosis Test contributed to the prediction of the Reading 
Prognosis total test score. The three areas, Language, Perceptual 
Diserimination, and Beginning Reading Skills contribute significantly 
to the prediction of the Reading Prognosis total test scores. The 
analysis included grouping of the data according to the total group, 
Middle SES (МС), and Lower SES (LC) groups. 


TABLE 3 
Intercorrelations of Reading Prognosis Total Test Score and Subtest Scores, 
for LC Group (N = 72) 
eee 


Total 
Subtests Score 
Subtests 17 URL. V nome f. 7 8 
- Capital Letters .915 .435 .668 .575 .659 —.020 M m 
ps Letters "318 .588 .517 .572 —.004 .197 . 
- Sight Vocab- 
ulary .377 .528 .432 .000 .300 .635 
- Meaning V. 
yy bulary 498 .500 —.081 102 .664 
à [uet e .688 „147 © .245 .799 
- Visual Simi- 
larities .040 .092 .798 
йүс tri —.049 .242 
crimination 2 a 


со N e e Ex [- һә 


8. Ногу: “E ЕН ЧӨ ШЕК RD E 


812 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Beginning Reading Skills subtests (Capital Letters, Small Letters, 
and Sight Vocabulary) yielded R?”’s of .854, .774, and .812 for the 
Total group (N—126), LC group (N—72), and MC group (№. —b4), 
respectively. Language subtests (Meaning Vocabulary and Story- 
telling) resulted in R?'s of .557, .423, and .407 for the Total, LC, 
and МО groups, respectively. 

The Perceptual Discrimination subtests (Auditory Discrimina- 
tion, Visual Discrimination, and Visual Similarities) which were 
also significant contributors toward the prediction of the Reading 
Prognosis total test score, yielded R*’s of .755, .781, and .712 for the 
Total, LC, and MC groups, respectively. 

The validity criteria, the Gates Sentence Reading and Paragraph ' 
Reading tests, were administered in June of the school year. Table” 
4 shows the intercorrelations of the reading grade scores with the 
Reading Prognosis total test score, as well as the means and 
standard deviations of the reading grade scores, by major groupings. © 

The correlations of the Reading Prognosis total test score with 
the Paragraph Reading test show a range of from .719 for the L 
Negro female group to .892 for the MC white male group. Th 
Total group r is .810. The correlations of the Reading Prognosis 
total test score with the Sentence Reading test show a range of from 


TABLE 4 


Intercorrelations between Reading Prognosis Total Test Score and Gates Primary 
Paragraph Reading (PPR) and Primary Sentence Reading (PSR) 
Grade Scores by Race, Sex, and SES 
Баг ыы ыагалысы=———————=———=—==—= 
Reading Prognosis PPR Grade PSR Grade 
"Test, with: Score Score 


PPR Grade PSR Grade ———— —— — Pa ee 
Groups N Score Score Mean SD Mean SD] 
Total Group 126 .810 .780 2.40 Aj 4 
LC 72 .723 .680 2.08 .59 
MC 54 .772 .738 2.82 .77 
Negro 60 .828 .780 2.28 .66 
White 66 .822 .792 2.51 .83 
LCN Male 17 .807 .700 1.87 43 
LCN Female 19 ‚719 ‚710 2.09 .48 
LCW Male 20 .781 .704 2.03 .66 
LCW Female 16 ‚736 ‚710 2.35 .65 
MCN Male 14 .748 .761 2.02 .72 
MON Female 10 .751 .614 2.84 .61 
MCW Male 15 .892 .759 2.76 .70 
MOW Female 3.05 .92 


WEINER AND FELDMANN 813 


TABLE 5 


Correlations of Reading Prognosis Test Subtests with Gates Primary Paragraph 
Reading Test, by LC, MC, and Total Groups 


] 


Gates Primary Paragraph LC MC Total Group 

Reading with; (N = 72) (N = 54) (N = 126) 
Total Score . 728 .772 .810 
1: Capital Letters .620 .624 :716 
- 2. Small Letters .654 .560 . 700 
3. Sight Vocabulary .407 .384 .534 
4. Meaning Vocabulary .625 .660 .668 
5. Visual Discrimination .612 .509 .609 
6. Visual Similarities ‚617 .683 . 687 
7. Auditory Discrimination .033 .145 .188 
8. Storytelling .215 .146 .290 


.614 for the MC Negro female group to .884 for the MC white fe- 

male group. There is little doubt that the Reading Prognosis total 

test score, when administered at the beginning of grade 1, is a good 
- predictor of Gates Paragraph Reading and Sentence Reading scores. 
- The range of mean grade scores on both tests is from 1.87 for LC 
| Negro male group to 3.05 for MC white female group. 
` Tables 5 and 6 show the intercorrelations of the subtests of the 

Reading Prognosis test with the Gates Paragraph Reading and 
| Sentence Reading grade scores for LC, MC, and Total groups. 

Test Revision 
Several items on each of the subtests were revised as à result of 

| item analysis interpretations. Subtests 7 and 8 were completely 
` changed because of their rather low validity and reliability. 


TABLE 6 


Correlations of Reading Prognosis Test Subtests with Gates 
Reading Test, by LC, MC, and Total Groups 


j 


4 Total Group 
- Gates Primary Sentence LC MC = 126 
Reading with: (N = 72) (N = 84) N - ) 
Total Score 680 738 745 


Capital Letters .658 yu ‚756 
Small Letters 731 393 515 
Sight Vocabulary -353 611 .602 
Meaning Vocabulary 547 461 545 
Visual Discrimination n 
Visual Similarities . ‚176 
Auditory Discrimination 013 107 
Storytelling 


чо оњ оо в 


юмор рм 
pre i 
& 


8 
8 


814 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Conclusions 


Based on the finding of the present study, it is believed that poor 
readers from any Socio-Economic Status group can be identified 
before formal training in reading takes place and that their skill 
deficiencies underlying reading can be ascertained. 

Again, the results of the second study were encouraging enough 
to initiate a third validation study, using the revised version of the 
test; this study is now under way. 


REFERENCES 


Feldmann, S. and Weiner, M. “The Use of a Standardized Reading _ 
Achievement Test with Two Levels of Socio-Economic-Status _ 
Pupils." Journal of Educational Research, in press. 

Gates Primary Reading Tests: Word Recognition, Sentence Read- 
ing and Paragraph Reading. New York: Bureau of Publications, 
Teachers College, Columbia University, 1958. 


EDUCATIONAL AND PSYCHOLOGICAL 
Vor. XXIII, No. 4, 1963 AR 


RELATIONSHIP OF HIGH SCHOOL CURRICULUM 
EXPERIENCES TO COLLEGE GRADE POINT AVERAGE 


JOSEPH PAUL GIUSTI 
The Pennsylvania State University 


Problem 


This artiele reports the relationship of high school curriculum 
experiences to college grade point average in the College of Educa- 
lion of The Pennsylvania State University. 


Subjects 


The high school records which were used were those of the 397 


men and women students who were admitted to the College of Edu- 


in September of 1960, and who were 


cation, on a regular basis, 
] not more than three 


graduated from an accredited high schoo 
months before that date. 


Predictor Variables 


The high school records of these students were studied with re- 


spect to six variables: High School Index (HSI) and five High 
School Subject Fields (HSSF).* The grade point averages of each 
of these variables were computed for each case from the high school 
admissions record form prepared by the Admissions Office of the 
University which standardizes all high school grades and interprets 
them according to the grade point system employed at the Uni- 
versity. 

ہس 


: А В i i fields: English. 
1 HSSF is a grouping of related high school subjects into five fi : 
Mathematics, History, Science, and Foreign Language; HSI is the general 
average in high school, including all subjects. 

815 


816 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Criterion Measure 


The CGPA (college grade point average) for the freshman year 
was selected as the criterion. Each student takes two courses in Eng- 
lish, Natural Science, and Physical Education, and one each in 
Philosophy, Psychology, and Sociology. The remaining courses, 
which are three in number, allow the student to engage in work 
appropriate to his field of specialization for the purpose of confirm- 
ing or modifying his choice of curriculum, The CGPA, comprised of 
subjects common to all students, was computed for each case from 
the University transcript. 


Statistical Treatment 


Through use of the IBM 7074 at The Pennsylvania State Univer- 
sity Computation Center, product-moment correlations were cal- 
culated to ascertain the strength of linear relationships among the 
various pairs of variables. 


Findings 

The findings of the study are shown in Table 1 which gives the 
matrix of product-moment correlations, standard deviations, and 
means for the seven variables. The first column of the matrix con- 
tains the correlations of each of the predictor variables with the 
criterion. The remainder of the matrix consists of correlations among 
the predictor variables. 

Comparison with similar data indicates that the findings of the 


study are fairly average with respect to the predictor variables and 
criterion measure. 


TABLE 1 


Intercorrelations, Means, and Standard Deviations of Predictor and 
Criterion Variables (N = 397) 


—_———є 


Variables CGPA. Xi х, х, xi х, Хх 
RM RBIS ИНЖУ Н X х, X X4 
X, HSI ATT 

X; English 401 777 

X4 Mathematics .338 .686 .410 

X, History .942 1.708 „524 .358 

X, Science .274 ‚485 .352 .245 ‚348 

X4 Language ‚334 „683 .537 .409 .450 ‚817 

Меап 2.50 3.00 320 28 3.18 2.84 2.77 
Standard Deviation .51 45 51 ‚67 55 ‚81 .83 


т > .113 significant at 5 per cent level of confidence. 
r > .148 significant at 1 per cent level of confidence. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 4, 1963 


INTELLECTIVE AND NON-INTELLECTIVE PREDICTORS 
OF SUCCESS IN NURSING TRAINING 


WILLIAM B. MICHAEL 
University of California, Santa Barbara 
RUSSELL HANEY 
Los Angeles, California 


AND 
ARTHUR GERSHON 
University of Southern California 


Problem 


For a sample of 117 freshman trainees in student nursing for the 
academic year of 1962-1963, it was the major objective of this 
investigation to obtain cross-validation data for several of the 
cognitive and non-cognitive predictors that had been employed in 
à recent study for the 1961—1962 freshman class reported by Haney, 
Michael, and Gershon (1962) as well as in prior investigations cited 
in the bibliography of the same recent study. Several of the pre- 
dictor variables of interest which were described in the recent study 
mentioned are listed in Table 1 (numbers 1 through 16) along with 
eight criterion variables which constitute grades in key courses 
(numbers 17-24) and along with two sets of ratings in five items 
on the Ward Performance Seale. Only those MMPI scales are 
reported for which validation findings in previous studies revealed 
Some degree of promise. The five items in the Ward Performance 
Scale pertained to qualities of planning work, ability to communi- 
cate, maturity, soundness of judgment, and task completion. 

For a second small sample of 27 nursing trainees at San Fernando 
Valley City College a parallel study was carried out involving 


many of the same predictor variables and eriterion variables as well 


817 


818 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


— 2 [2 ОР [r4 60 0с 6c Ig (Z961 eun) əsəs әоиешлорәд pio AA "95 
19 — 8б 29 от 90— #0 SI zz 9б (Z961 Areniga,y) әрзә әәиешлорәд PIEM "ea 
6F 8% — 0€ eg T} os Ig c9 89 Sopuir үү Хләвлам “FZ 
OF Li 0€ = 18 18 c£ ze yv OF Sopeip ү AIOsINN “EZ 
£c or eg 1g — 6F E IG 99 Ly Sop?ir порту "CZ 
60 90— 1ғ 1g 6F = 8¢ If IG 9g вәрелг) тт Á3o[ogoAsq "Tc 
0c £0 08 cg gg 8e — GF eg gr вәрәг) ү A3opoqoAsq '0g 
6c 81 IG cg 1g It oF — 59 gy sopvir) AZo[porqoorp "6I 
Ie ec c9 v 9ي‎ IG ГАЧ eg — 99 вәрәг) A3opors&uq "ST 
og 9% sg OF n 9g SF gy 99 — вәрвлгу Хшоўвчу ‘ZI 
91 90- Sc 90 10 £0 90 0c ес 20 (Vdb зәзвәшә$ 0ML) ANsTUIAYD үооцо$ 43H "9T 
©0 gI 9% OI РТ or ST 9g 6с Zr әЗвләлү juroq oeptir) [ooqos YIH “ST 
то Li— OSS 90-9 OO IS OS AG @- IIS MF + Ра—14ЙИГ ‘FI 
Oe EGS 10— OS CE ET OE LL COE 00 eres AH—Id WIN “ST 
ЮС te-"  90— .9I— —€0 yo OE 0079 50— 00 arog q—IdWW “ZI 
PI—- oi- :00—- 80— .%0 91—  80—  90— -10—- TI 9reog 79° + SH—IGWW “IT 
£I— ]و‎ 4I- 9I- $I- zz- и— اچ دا6‎ 20+ ƏS I—Id WW “OT 
*0— ©0— zo- 00 80— wO0r—- g= -or- -90 £0— uorzmenstA oo€dg—6 “ON LSVA `6 
90 ZO SI 90 <0 60 10— #0 OI 90 gx pus peodg [ensrA—F ‘ON LSVA `8 
90—  90— 20 or ©0 10 00 10— OI 10 qmsimg [enstA—E “ON LSVA `2 
80 80— 00 T0—  10— 90 TI £0— 80 10— (spe)uourspun;p) вә, черү ‘FIO `9 
0— 10 90 or 70 ZI ST YO £I £I (Suyuosvay) вә], PEW ‘JIBO '9 
w-  £0— 20 10— 20 FI 91 90 II 10 (T9391) зәт, YEW ‘FO °F 
$0— 20 ZI T0- 80 SI Sz 80 70 90 (иотѕпәцәлішо/)) вә], Furpeo HPO "e 
80 £0 11 TO Ir 82 ££ 81 ST 0c (Хївүпдвәод ) 389, BUPL HV) `Z 
00 00 6T 10 п 1, eg л ZI 6I (т®зо,) 389, Зшрвәң ‘JIO ‘т 
Moz) 409) (2 (80) (05) (15) (05) (61) (81) (20) 


„вәлїввәрү WOLD 


jo suorv[o110019)uT ҷим Suo[y влоўәтрәл] Jo вңпөоцәоо) трд 


(9Z-LT se[qvr9 A ) soxnsvopq 
WOME} риз (91-1 se[quize A) s10jorpa1q 


(LIT = N) ї®ййзоң fijuno;) ғәјәбиүу SOT IYI jo SED) Sussinny 6961-8961 40/ seansvayy fo syuatoyfoog frp A aaynpesrg 


T IVL 


WILLIAM B. MICHAEL, ET AL. 819 


as certain additional cognitive and non-cognitive predictors. Since 
the results were highly similar to those found for the larger group, 
mention will be made only of findings relative to the scales of the 
Edwards Personal Preference Schedule (EPPS). 


Statistical Treatment 


Through use of the IBM 7090 correlation program at the Western 
Data Processing Center of U.C.L.A. product-moment coefficients of 
correlation were calculated among all possible pairings of variables 
described in Table 1. (A parallel analysis was carried out for the 
second sample of 27 trainees.) For standing on the items on the 
Ward Performance Scale the same normalized-rank method of 
conversion of scores to stanine scores as described by Haney, 
Michael, and Gershon (1962) was employed to adjust for differences 
in sizes of classes (usually 12 to 20 students) that were being super- 
vised in Ward activities. It should also be noted that the magnitudes 
of the correlation coefficients are minimal in view of a marked 
restriction of range in scores upon the first six variables with 
respect to which students had to place at or above approximately the 
50th centile relative to published twelfth grade norms. No correc- 
tions for restriction of range are reported. 


Results 


The following principal findings may be summarized for the 
entries of Table 1. 24 

l. In comparison with classes of previous years the predictive 
validities of the California Tests and of the Employment Aptitude 
Survey Tests (EAST) are low. However, as mentioned previously, 
` the restriction of range was substantial. Although not reported, the 
standard deviations of the scores on various cognitive predictors 
Were frequently one-half to two-thirds as large as those of the 
Scores for the unselected, or total, group of applicants. : 

2. As inspection of the validity coefficients will indicate, grade 
Point averages in the high school program as well as in two semesters 
of chemistry were significantly related to success In Physiology , 
Microbiology, and Nursing II; whereas for the most part the intel- 
lective tests were not correlated significantly with the eriterion 
variables. On the other hand, total scores on the California. Reading 
Test and on the Vocabulary part were significantly correlated with 


80 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


achievement in Anatomy, Psychology I, and Psychology II; whereas 
for these same criterion variables high school attainment was not 
significantly related. 

3. As is apparent from correlational data none of the intellective 
predictors—level of achievement in high school or standing on apti- 
tude and achievement tests—was significantly predictive of success 
of the composite of five measured qualities of Ward Performance. 
However, it is urged that further work be done to improve the 
coverage and reliability of this eriterion. A test-retest estimate of 
reliability of .77 was found for the Ward Performance Scale. 

4. With respect to the MMPI, there is, as there was with the 
previous freshman class, a pattern of negative validity coefficients 
of the scales reported relative to both the academic criterion meas- 
ures and the two Ward Performance measures. Although many of 
the coefficients do not reach significance at the .05 level, there is 
the suggestion that those trainees who stand lower on the MMPI 
scales (and thus reflect what might be considered a more favorable 
degree of adjustment) tend to be slightly more successful in the 
program than the trainees who place relatively high on these charac- 
teristics. In particular one might infer that the more successful stu- 
dents as compared with the less successful ones tend toward less 
falsification of responses and toward manifesting behavior that does 
not represent a syndrome of depression, hysteria, hypochondriasis, 
or psychopathic deviation. 

5. For the most part grades in courses showed moderate to high 
intercorrelations. Apparently there are sources of common-factor 
variance underlying achievement in course work. Hypothesized 
sources of systematic variance would include: (1) halo effect, (2) à 
motivational syndrome, (3) academic savoir-faire (perception of 
and empathy with a teacher’s or supervisor’s needs and expecta- 
tions), and (4) institutional press manifested by conformity to 
imposed standards of conduct, attendance, punctuality, and by prat- 
tice of sanctioned and rewarded patterns of work habits. 

6. For the small sample of 27. candidates the Succorance, Con- 
formity, and Order Scales of the EPPS consistently yielded, respec- 
tively, negative, positive, and positive correlations with the 15 
items of a Nursing Evaluation scale and grades in nursing courses. 
Since the majority of these variables were statistically reliable 
further explorations with these scales are planned. 


WILLIAM B. MICHAEL, ET AL. /—— ^8 


Conclusion 


In general, aptitude and achievement test scores and grades in 
high school were only modestly predictive of success in the academic 
portions of the nursing training program; whereas four scales of the 
MMPI were negatively related to a criterion measure of Ward Per- 
formance. For improvement in prediction of the latter criterion it is 
suggested that a content analysis of autobiographies of trainees and 
of statements of supervisory nurses as to what constitutes the image 
of а "good nurse" might afford a basis for generation of categories 
of behavior about which biographical and interest items could be 
constructed, then could be validated empirically, and subsequently 
could be assigned appropriate scoring weights. 


REFERENCE 


Haney, R., Michael, W. B., and Gershon, A. “Achievement, Apti- 
tude, and Personality Measures as Predictors of Success in 
Nursing Training.” EDUCATIONAL AND PSYCHOLOGICAL MEASURE- 
MENT, XXII (1962), 389-392. 


ug 


dle 


H AB 


Ze Cu 


CHE UAM 
Mur 


Arp! 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 4, 1963 


4 STUDY OF THE VALIDITY OF THE PROGRAMMER 
APTITUDE TEST 


THOMAS C. OLIVER ax» WARREN К. WILLIS 
Southern Illinois University 


Background 


As the number of institutions of higher education with computer 
installations inereases, educators are faced with the relatively new 
task of training computer programmers. It is well known that the 
demand for proficient programmers has far exceeded the supply in 
the past few years. It is apparent that methods of selecting poten- 
tially competent programmers are in great demand. 

The Programmer Aptitude Test, (PAT) is a paper and pencil test 
designed to measure the kinds of reasoning abilities judged to be 
important in computer programming. The test is divided into three 
sections, namely, Number Series, Figure Analogies, and Arithmetic 
Reasoning. A total score consisting of the sum of the three sub-tests 
is obtainable. The reliability of the PAT is reported by the authors 
of the test, The corrected Spearman-Brown reliability coefficient is 


“90 for the total test. 


Problem 

etermine the validity of the 
mputer programming course 
e is designed to acquaint 
mming. Topies include 


The purpose of this study was to d 
PAT in predicting final grades in a co 
at Southern Illinois University. The cours 


the student with digital computer progra : 
computer organization and characteristics, machine language cod- 


ing, flow charts, sub-routines, symbolic coding, and compiler вув- 
tems, The facilities of Southern Illinois University's Data Processing 
2nd Computing Center are used for extensive laboratory applica- 


828 К 


824 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


tion. Final grades in the course are based upon demonstration of 
proficieney in integration and application of computer programming 
principles. 


Sample 


The sample consisted of 27 subjects completing the computer 
programming course at Southern Illinois University during the 
Summer Quarter of 1962. Subjects who registered for the course 
(N—14), but who did not complete the course, were not included 
in the study. 


Procedures 


The PAT was administered to class members on the opening day 
of the course. The results of the examination were withheld from the 
instructor until final course grades had been submitted. The grading 
system is on a five-point (4—5, B—4, C—3, D—2, E—1) scale. 


Results and Discussion 


Means and standard deviations of the PAT, and zero-order corre- 
lation coefficients between the four PAT scores and course grades 
are reported in Table 1. In addition, intercorrelations were com- 
puted between all sub-tests and the total score on the PAT. The 
results of this analysis are reported in Table 2. 

The correlation coefficients reported in Table 1 indicate relatively 
high validities for the PAT. All validity coefficients were statistically 
significant at beyond the .001 level. It should be noted that the total 
PAT score is the best single predictor of course grades. It is evident 
that the PAT has, in this study, demonstrated exceptionally high 
utility for predicting grades in a computer programming course. 


TABLE 1 


Means, Standard Deviations, and Zero-Order Correlations between the 
PAT and Course Grades* 


uaua‏ پڪ 


Standard Validity 


Test Means Deviations Coefficients 
Number Series 15.68 3.03 65 
Figure Analogies — 26.95 5.78 70 
Arithmetic Reasoning 13.05 3.76 40 
To 52.22 11.16 78 


Е 


— روا‎ ee ———— —————— —————— 


OLIVER AND WILLIS 825 


TABLE 2 
Intercorrelation Matriz of the PAT Test 


1 2 3 4 
Number Series — .66 ‚51 .86 
Figure Analogies — .35 .83 
Arithmetic Reasoning — -73 
Total = 


However, predictive validity coefficients should be determined for 
additional samples in order to ascertain whether the promising 
validity will hold up under cross-validation. 

Table 2 contains an intercorrelation matrix of the four measures 
of the PAT. Correlations ranged from .35 to .73 and all are sig- 
nificant at beyond the .001 level. 


Summary 


The study was concerned with establishing the predictive validity 
of the PAT in relation to computer programming course grades. The 
test was administered to 27 students at Southern Illinois University 
on the first day of class during the Summer Quarter of 1962. 

Although analysis of the data indicated the PAT to be a valid 
predictor of grades in the course (p < .001), cross-validation studies 
with large samples of students in computer programming courses are 
needed to demonstrate the utility of the test. Such studies are cur- 
rently underway. 


"Y M du pv ng 

Ках, Ж 

teni 
Piye les 1 


h f. 
w” 
Ө ga : 
(AA т 
^ 


1 MOT ee: С за 


N 
4 org ly 3 


| 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 4, 1963 


THE COMPARATIVE EFFECTIVENESS OF INTELLECTIVE 
AND NON-INTELLECTIVE MEASURES IN THE 
PREDICTION OF THE COMPLETION OF A MAJOR 
IN THEATER ARTS 


JACK MORRISON 
University of California at Los Angeles 


Among others, Anastasi, Meade, and Schneider (1960) and Getzels 
and Jackson (1962) have indicated that tests of intellectual ability 
do not effectively reflect potential “success” among students of а 
fairly high intellectual level. They declared, however, that bio- 
graphical and autobiographical materials do hold promise as effec- 
tive, diseriminating variables. Evidence from a study of 400 Theater 
Arts Majors at UCLA included Quantitative and Linguistie scores 
from the American Council оп Education (ACE) Psychological 
Examination and statements of behavior from a semi-structured au- 
tobiography. Both men and women were divided into two groups: 100 
who completed their Bachelor of Arts degrees (designated as BA's) 
and 100 who dropped out of the Theater Arts Major (referred to as 
DO's). Frequencies, means, and chi squares were computed through 
use of the Questionnaire Analysis Program I for the IBM-709 
(Model 4) at the Western Data Processing Center. 

Findings 

Among the men, the BA's mean score on the Quantitative scale of 
the ACE Psychological Examination was lower than the DO's mean 
score at a significance level of .20. On the Linguistic scale of the 


ACE Psychological Examination the mean of the male BA's was 
also lower than that of the DO's at а level of significance approxi- 


mating .30. (See Table 1.) 
827 


828 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 1 


Mean Scores for Theater Arts Men and for Theater Arts Women on the 
ACE Psychological Examination 


N Quantitative Linguistic 
Male VA's 100 44.68 73.72 
Male DO's 100 45.95 74.81 
Female BA's 100 43.40 74.40 
Female DO's 100 44.19 73.38 


Among the women, the mean of the BA's was lower than that for 
the DO's at a significance level of .30 on the Quantitative scale of 
the ACE Psychological Examination. On the Linguistic scale of the 
ACE Psychological Examination the mean for the BA's was slightly 
higher than that of the DO's, but not significantly (P > .60). (See 
Table 1.) 

From the men’s autobiographical material the BA’s statements of 
“conviction” to be a Theater Arts Major appeared more often than 
similar statements for the male DO’s—a difference significant at 
the .01 level. (See Table 2.) Significant at the .10 level was the 
difference between mean frequency of reported incidents of partici- 
pation in shows (acting, directing) on the part of DO’s and BA’s, 
the means for whom were 4.51 and 5.49, respectively. 

The women produced four variables relative to which significant 
differences occurred between the BA’s and DO’s. The data for three 
of these variables are described in Table 3. The Female DO's did 
tend to name specific occupational goals in their autobiographies 
more often than did the BA’s (significant at the .05 level). The 
female DO’s also did express a higher frequency of statements of 
indecision to be a Theater Arts Major than did the BA’s (significant 
at the .01 level). The BA’s, however, did report “Conviction” to be 


in Theater Arts more often than did the DO's (significant at the 
.01 level). 


TABLE 2 
Frequency of Men's Reports of “Conviction” to be a Theater Arts Major 
————= 
No “Conviction” “Conviction” Totals 
Male BA's 28 


72 100 
Male DO's 52 48 100 
Totals 80 120 


ere س‎ 


JACK MORRISON 829 


TABLE 3 


Frequencies of Responses of Women in the BA and DO Groups Who Reported 
Yes or No to Three Autobiographical Items 


Mode of Responses 
Item No Yes Totals 

1. Having Specific Occupational] BA 59 41 100 

Goals DO 44 56 100 
Totals 103 97 

2. Indecision to be a Theater) BA 91 9 100 

Arts Major DO 69 31 100 
Totals 160 40 

3. “Conviction” to be a Theater| BA 28 72 100 

Arts Major DO 52 48 100 


Totals 80 120 


The female BA’s did mention a greater frequency of occasions 
pertaining to “working in shows” (acting, directing) than did the 
DO’s. The means for the BA’s and DO’s were, respectively, 5.49 
and 4.51. 


Summary and Discussion 


Although the Quantitative and Linguistic scores on the ACE 
Psychological Examination did not discriminate significantly be- 
tween the BA's and the DO's, reports from the semi-structured 
autobiography did at relatively high levels of significance. Motiva- 
tion expressed by a “conviction” to be in the Major was found to be 
a strong factor in completing the work for a Bachelor's degree for 
both men and women. Moreover, the women DO's did express the 
obverse of this by reporting their indecision to be in the major at 
the outset as well as reporting “specific occupational goals”—usually 
to be an actress—more often than did the female BA’s. Both men 
and women BA’s cited more experiences of working in high school 
plays than did DO’s. . 

Accordingly, it appears that drive, as expressed in autobiographi- 
cal statements, may be of significance in predicting completion of 
the BA degree in Theater Arts; whereas intellectual ability as indi- 
cated by either the Quantitative or Linguistic scales on the ACE 
Psychological Examination is not. 2 E 


830 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Furthermore, participating behavior, shown by frequency of re- 
ports of performance in plays in high school, significantly differenti- 
ated between those who did achieve the BA degree in Theater Arts 
from those who did not, whereas standing on the two scales of the 
ACE Psychological Examination did not. 

Further research in identifying successful students aspiring to the 
BA degree in Theater Arts from the entire body of students who 
сап meet the University entrance requirements may well be highly 
promising if “non-intellective’ measures are developed. What the 
student wants and what he has done in the field appear to be strong 
factors in his completing the degree or in dropping out. These find- 
ings join the converging lines of research already reported by the 
investigators named at the beginning of this report. 


REFERENCES 


Anastasi, Anne, Meade, Martin J., and Schneiders, Alexander A. The 
Validation of a Biographical Inventory as a Predictor of College 
Success. New York: College Entrance Examination Board, 1960. 

Getzels, Jacob W. and Jackson, Philip W. Creativity and Intelli- 
gence. New York: John Wiley and Sons, Inc., 1962. 

Morrison, Jack. Four Hundred New Students in Theater Arts: An 
Experimental Study of Those Who Dropped Out and Those Who 
Achieved the BA Degree at UCLA. Unpublished Ph.D. thesis. 
Los Angeles: University of Southern California, 1962. 


9. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 4, 1963 


COMPARISON OF THE VALIDITIES OF SELECTED TEST 
PROCEDURES TO PREDICT SHORTHAND SUCCESS 


WALTER PAUK1 
Cornell University 


The Turse Shorthand Aptitude Test (1940) purports to predict 
shorthand success in high schools by measuring the following seven 
abilities: writing symbols rapidly; listening and writing simultane- 
ously; learning and combining abstract symbols; associating the 
correct spelling of a word upon seeing its phonetic form; discriminat- 
ing between words having similar or identical shorthand outlines; 
Spelling correctly; and constructing words from incomplete short- 
hand outlines, 


Problem 


Users of this test are faced with two practical problems, both in- 
volving time, First, the total administration time of the test (dis- 
tribution, collection, filling in of data, and instructions for each 
Subtest) is approximately 60 minutes. The usual class period is thus 
exceeded by approximately 10 minutes. The second problem is that 
Since the nature of the test precludes machine scoring, approxi- 
mately 15 minutes are needed to score each test by hand. 

It was the inordinate use of time—time to administer and time to 
*core—which led this researcher to ascertain whether or not one or 
more subtests could be eliminated on the grounds of duplication or 
low validity without impairing the efficiency of the total test to 
Predict shorthand success, Obviously, the elimination of one or more 
=... 


„> The writer is grateful to Dr. Jason Millman, assistant professor of pao 
tional P Sychology and Measurement at Cornell University for his help on the 
Statistica] aspects of this study. 


832 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


subtests would shorten the total test and would thus effect a saving 
of time. 

After a close examination, the researcher concluded that the four 
subtests (Spelling, Phonetie Association, Word Discrimination, and 
Word Sense) were primarily tests of verbal ability. Thus, all four 
might be measuring, to some degree, the very same ability. 

If, in fact, verbal ability were being measured by these four sub- 
tests, the researcher then speculated that a standardized vocabulary 
or linguistic test would probably measure and predict as well as 
these "shorthand" subtests. 

The researcher further speculated that the ability measured by 
the remaining three subtests (Stroking, Symbol Transcription, and 
Dictation) had very little bearing on predicting success in the actual 
taking of shorthand for several reasons. First, in the Stroking sub- 
test it is difficult to see how the making of short downward strokes 
in a rapid and mechanical fashion would lead to the prediction of 
shorthand success, since, in the actual situation of writing shorthand, 
success depends on a person's ability to convert rapidly the spoken 
word into a symbol made up of a combination of lines—both curved 
and straight and varying in length and direction. To write symbols 
rapidly and correctly, strong associations, built up through a great 
deal of practice, must exist between the words and symbols. Second, 
the Symbol Transcription subtest unfortunately does not depend on 
a strong mental association between the spoken words and symbols. 
This subtest actually lists 14 separate letters with a hieroglyph 
beneath each letter. The student is then directed to solve some 
shorthand symbols. This can be done simply by visual inspection 
which requires a bare minimum of memorization or association. 
Third, in the Dictation subtest, the administrator reads aloud single 
sentences which the students record in regular handwriting (long 
hand), just as they would do in taking notes of a lecture. This 
activity, however, does not involve the ability which is needed in 
shorthand; that is, the ability to convert the spoken word into а 


symbol and then to write the symbol (the end result is the short- 
hand “character’). 


Purpose 


There were three major objectives in undertaking the investiga- 
tion of the validity of tests in prediction of shorthand marks of 
ninth grade girls. à 


| 


WALTER PAUK 833 


1. To ascertain whether or not the magnitude of the validity 
coefficient of the four "verbal" subtests is significantly different from 
that of the three subtests designed to measure the mechanies of 
shorthand when shorthand grades are used as the criterion. 

2. To determine whether or not the values of the validity coeffici- 
ents of the “verbal” subtests, either individually or in combination, 
are significantly different from that value of the validity coefficient 
for the entire test when the scholastic grades achieved in a full year 
of shorthand are used as the criterion variable. (One may anticipate 
that if shorthand grades are not strongly related to subtests attempt- 
ing to measure the mechanies of shorthand, the presence of these 
subtests may actually reduce the validity of the total test.) 

3. To obtain evidence as to whether or not the validity coefficient 
of the linguistic ability (L-score) on the ACE Psychological Ex- 
amination for High School Students (1953) is significantly different 
from that of the total Turse Shorthand Aptitude Test. It may be 
anticipated that if shorthand grades are strongly related to the 
verbal subtests, then it may be that a more general linguistic apti- 
tude test, such as the L-score of the ACE would be equally valid 
for predicting shorthand success. 


> Results 
The predictive validity coefficients (zero-order correlations) for 
the various tests and subtests relative to the criterion of grade 
point average are described in Table 1. 


TABLE 1 
Predictive Validity Coefficients of Test Scores With Shorthand Grades of 
Ninth-Grade Girls (N = 41) 
=< 


Correlation 
Tests and Subtests Coefficients 
Verbal Subtests 
Spelling .56 
Phonetic Association .58 
Word Discrimination .62 
Word Sense .63 
Total (all four subtests) .66 
Mechanical Subtests 
Stroking .04 
Diction 18 
Symbol Transcription .97 
Total (all three subtests) .34 
Total—T'urse test .63 


$$ aeo осш ы 


834 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Conclusions 


1. The validity of the four verbal subtests to predict shorthand 
success is greater than the validity of the combined three mechanics- 
of-shorthand subtests (.66 vs .34). The validity of the combined 
verbal subtests differs significantly from the validity of the com- 
bined mechanical subtests at the .01 level of significance. 

2. The validity of the combined four verbal subtests to predict 
shorthand success is about the same magnitude (.56 to .66) as the 
validity of the total score on the Turse test (.63). Individually, the 
validity of one subtest, Word Sense (.63), is equal to the total 
Turse test (.63). 

3. The validity of the total score on the Turse test is not sig- 
nificantly different from that of the linguistic ability (L-score) of 
the ACE test (.63 vs .63). 


Discussion 

In keeping with the theme of saving time without sacrificing pre- 
dictive efficiency, and keeping in mind that it takes one full period 
plus the greater portion of another to administer the entire T'urse 
test, the following testing times are given: the total test-taking time 
for the combined four verbal subtests of the Turse test is 23 minutes; 
the time for the Word Sense subtest of the Turse test is but 7 
minutes; and the time for the Linguistic portion of the ACH 
is 15 minutes. (The test-taking time for the entire ACE is 35 
minutes.) 
Based on the above discussion and the results shown in Table 1, 
it appears, then, that it is an unwise use of time to administer the 
entire Turse test for the purpose of predicting shorthand success in 
high school especially when the verbal subtests of the T'urse test, 
individually or collectively, or a standardized linguistic aptitude test 
may predict as well. Incidentally, the use of the ACE in its entirety 
has a bonus effect: it would provide a score not only useful to the 
commercial departments for predicting shorthand success, but also 
a highly reliable and interpretable score useful for purposes of 
counseling and guidance. 

Of course, cross-validation studies are needed to ascertain whether 


these results would still hold in new situations. Such studies are 
planned for the future. 


WALTER PAUK 835 


REFERENCES 


Thurstone, L. L. and Thurstone, T. G. American Council on Educa- 
tion Psychological Examination for High School Students. 
Princeton: Cooperative Test Division, Educational Testing 
Service, 1953. 

Turse, Р. L. Turse Shorthand Aptitude Test. New York: World 
Book Co., 1940. 


5, 


eer | 
NL 
ч эг w чак E ^ 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Vor. XXIII, No. 4, 1963 


BOOK REVIEWS 


Edited by 
WILLIAM B. MICHAEL 
University of California, Santa Barbara 


Gage's Handbook of Research on Teaching. Haroun Н. 
ABELSON ....... cea ds ss an Кек ак os URS T DEREN EE 


Cooley and Lohnes’ Multivariate Procedures for the Behavi- 
oral Sciences. FRANK B. BAKER .......... ern 
Hammond and Householder's Introduction to the Statistical 
Method. FRANK B. BAKER .............жя еж езеж” 
Tyler’s Tests and Measurements. HERBERT ZIMILES ......... 
Weinberg and Schumaker's Statistics: An Intuitive Approach. 
Семе V. GLAS8. „2... ооо ЗЕЕ 
Gagné's Psychological Principles in System Development. 
JOSEPH С. PHELAN S.K. КЕЙ КЕ ЕКЕЖ елу rnt 
Hunt's Concept Learning—An Information Processing Prob- 
lem. Joun D. FORD, JR. ...... rn Imt 
Bachrach’s Psychological Research, An Introduction. HAROLD 
BORKO ......« alae aate cua ples v elato PENIS ie ERE Г 
Berelson's The Behavioral Sciences Today. HAROLD Вовко 
Sherif's Intergroup Relations and Leadership. Јонх De- 
LAMATER ... генет s ya реле ons АЛАУ 40 tr rui ДА LIS 


LAMATER. О ren Mol ES МОДЕ e 
Lott and Lott's Negro and White Youth: A Psychological 
Study in a Border-State Community. EDYTHE MARGOLIN 
Cronbach’s Educational Psychology. REGINALD L. JONES ... 
Smith and Dechant's Psychology in Teaching Reading. Mir- 
DRED ROBECK <. laie ea SS anae Ses 


855 
856 


857 


859 


860 
864 


n. LM Й REL Pa 


i үз йү) 


Иза, > 


ETT 


"n 


Twice e 


dut oes 


кн! d 
cb tmê“ 


Куут С 
"m 
ue 
anu 
їс vt 
' 
' 


EDUCATIONAL AND PSYCHOLOGICAL MEASU 
Vor. XXIII, No. 4, 1963 XUL 


Handbook of Research on Teaching by N. L. Gage (editor.) Chi- 
cago: Rand McNally & Co., 1963. Pp. v + 1218. 

As stated by its editor, the primary purpose of the Handbook is 
to empower workers on research on teaching “. . . to begin at a 
higher level of competence and sophistication, to avoid past mistakes 
and blind alleys, to capitalize on the best that has been thought 
and done.” It attempts to summarize, analyze critically, and inte- 
grate the vast body of research on teaching of the last half century; 
yet it seeks at the same time to remedy an alleged failure of such 
research to remain in touch with the behavioral sciences. Although 
the individual chapter authors were given wide latitude in which to 
develop their assigned areas, they were presented with a plan pre- 
pared by the editor with the advice of a Board consisting principally 
of the Committee on Teacher Effectiveness of the American Educa- 
tional Research Association, as of 1957. In particular, a framework 
for research on teaching was suggested to the authors of the chapters 
on “substantive problems and findings.” This framework consisted 
of a “variables” approach, with central, relevant, and site variables 
as three major classes. Fortunately, the deadening possibilities of a 
narrow variables approach were avoided by (1) the granting of 
freedom, freely taken, to authors to develop their respective subjects 
responsively to the diverse character of the research reported, and 
(2) the inclusion of searching chapters on the historical, philosophi- 
cal, theoretical, and methodological background of research on 
teaching. 

Dr. Gage and his colleagues have assembled an exceptional group 
of talented and academically disciplined scholars to write the 23 
chapters of the Handbook. The chapters are grouped under four 
parts: I. Theoretical Orientations, II. Methodologies in Research on 
Teaching, III. Major Variables and Areas of Research on Teaching, 
and IV. Research on Teaching Various Grade Levels and Subject 
Matters, 

The grouping of the topies should be viewed as one of ready 
Convenience only, for some of the most astute theoretical and 
methodological insights are revealed in the last two, or substantive, 
Sections. This does not mean that the three presentations under 
Part I, for example, are not highly meritorious. Thus, Broudy's fresh 


839 


840 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


treatment “Historical Exemplars of Teaching Method," tracing as 
it does such widely diverse approaches to teaching as the ancient 
teaching of oratory, the Socratic dialectic, scholasticism, Jesuit edu- 
cation, and the pedagogical principles of the great educators Comen- 
ius, Pestalozzi, Froebel, and Herbart, serves implicitly as a causti¢ 
remonstrance against the widespread neglect of pre-twentieth cen- 
tury teaching theory and practice as needed sources of insight for 
current conceptualization. Brodbeck’s keen analysis of logic and 
scientific method, although couched in general terms that might 
escape the impatient reader, is also truly a “must” for the serious 
researcher. Gage’s summary of paradigms for research on teaching 
is a masterpiece of systematization that should help researchers to 
make order out of a seeming chaos of patterns. 

The technically excellent chapters of Part II, which treats the 
Methodologies in Research on Teaching, serve to systematize statis- 
tical, experimental, and quasi-experimental techniques and to offer 
numerous suggestions concerning data-gathering through classroom 
observation and through the use of rating and measurement devices. 

Tatsuoka and Tiedeman develop their succinct treatment of 
selected statistical techniques around a chart that classifies these 
techniques according to their “role” or function, scale type em- 
ployed, and the number of variables involved. In view of the over- 
bearing weight currently assigned to tests of statistical significance, 
the reading of their treatment of the underlying rationale of such 
tests may well be followed by the perusal of Lumsdaine’s cautionary 
discussion, later in the Handbook, of the null hypothesis and the 
interpretation of “negative results.” 

_ Using McCall’s 1923 book entitled How to Experiment in Educa- 
tion as a point of departure, Campbell and Stanley carry the reader 
systematically through a maze of pre-experimental, experimental, 
and quasi-experimental designs. The exposition is exceptionally clear 
for such material. It is well supported with appropriate illustrations. 
Systematic mastery and evaluation of techniques are further en- 
hanced through the use of summarizing tables in which the various 
designs are rated against a set of stated criteria. 

In the field of data-gathering procedures, Medley and Mitzel have 


done a craftsmanlike job in analyzing the modes for measuring. 


classroom behavior. Their chapter should go far toward enhancing 
the efforts of those who seek definitive and statistically sound data 
in this respect. Remmers extends his treatment of “Rating Methods 
in Research on Teaching” beyond the more usual kinds of rating 
scales to include sociometric methods, the semantic differential, and 
Q-technique rating. In his chapter on “Testing Cognitive Ability 
and Achievement” Bloom considers such topics as follows: what 
the student brings to the educational enterprise; how learning 
experiences modify behavior; how examinations affect students; and 


BOOK REVIEWS 841 


how the efficiency of achievement examinations may be improved. 
He notes in conclusion that examinations are not an end in them- 
selves and that examinations used in research on teaching may 
readily become a spurious part of the independent variable. The 
chapter on “Measuring Noncognitive Variables in Research on 
Teaching” sweeps widely over such diverse topics as the art of 
teaching, early attempts to measure volition, the contribution to 
measurement of psychopathology, multivariate assessment, the soci- 
ology of attitudes and values, depth psychology, neo-Gestalt per- 
ceptualism, case-study method, the causal-genetie method, and 
situational methods, not to mention a variety of topics dealing with 
the relation of measurement to methodology in teaching. 

Without derogating these scholarly treatises it should be noted 
that they leave entirely untouched the question as to how research 
methodology may be employed to create or discover new modes of 
teaching. To be sure, the researcher who utilizes the described ex- 
perimental or data-gathering procedures stands ready to detect 
whatever he may come across that is new in teachers or teaching. 
However, this important objective of educational research is granted 
incidental status at best, while essential emphasis is placed on the 
question: Given an experimental factor or variable, how potent is 
it? Moreover, it may be too much to expect this section of the 
Handbook, which is keyed to the more technically limited conception 
of research, to clear up one of the enigmas of educational research; 
namely, why the excellent beginning Bloom and his colleagues made 
in developing a taxonomy of educational objectives has not been 
more energetically pursued in the cognitive domain or more ade- 
quately extended to the noncognitive domains than it has been. It 
would appear evident that until the philosophical question of desired 
educational outcomes is transformed into terms of psychologically 
recognizable and procedurally manageable indicators, many experi- 
mental and descriptive studies will remain rudderless. 

Part III of the Handbook entitled “Major Variables and Areas of 
Research on Teaching" covers five topics: (1) teaching methods; 
(2) the teacher's personality and characteristies; (3) instruments 
and media of instruction; (4) social interaction in the classroom ; 
and (5) the social background of teaching. A mint of useful, sub- 
Stantive research is to be found under these five rubries, but Part III 
шау be read with equal if not greater profit for its potential con- 
tribution to a meaningful and fruitful conceptualization of the 
teaching process as a subject for research. Thus, Wallen and Travers, 
їп the opening chapter in this Part, argue that teaching methods as 
Studied in the past have employed non-behavioral. terms for the 
Most part. The authors draw attention to such origins of teaching 
methods as teaching traditions, social learnings in the teacher 1 
background, philosophical traditions, teachers' needs, school an 


842 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


community conditions, and research in learning, not to speak of the 
teacher’s conception of his role. They also describe mental health 
approaches to teaching along with the more traditional modes. Their 
own preference is for a notion of “patterns of teaching behavior” as 
a basis for their proposed model for analyzing teaching methods. 
Here is much grist for the mill of the researcher on teaching who is 
still asking the question: What is teaching? Gage’s chapter on 
“Paradigms for Research on Teaching” may well be reexamined 
in this context. 

After presenting a well organized and “meaty” summary of dozens 
of research studies in their chapter on “The Teacher’s Personality 
and Characteristics,” Getzels and Jackson conclude that “despite the 
critical importance of the problem and a half century of prodigious 
research effort, very little is known for certain about the nature and 
measurement of teacher personality, or about the relation between 
teacher personality and teaching effectiveness.” Difficulties of defini- 
tion, instrumentation, and criterion beset research in this field. The 
failure of research to achieve little more than the obvious, according 
to the authors, may be laid mainly at the door of the empirical, 
ad hoc attitude of the vast majority of researchers as against an 
approach which formulates and applies a theoretical frame of refer- 
ence to the problem at hand. 

Lumsdaine’s treatment of instruments and media of instruction 
makes an interesting sequel to the chapter on teacher personality 
and characteristics, for the media of instruction have often been 
viewed as an alternate to the teacher as a person. Lumsdaine points 
the way out of the false dilemma of choosing between the teacher 
and the teaching medium by anchoring much of his discussion in 
processes and factors affecting the learner as the recipient of the 
teaching function. While Lumsdaine presents an informed review 
of research on the efficacy of the several media as well as a penetrat- 
ing account of methodological problems in this area, the most 
pointed portions of this superb chapter deal with such topics as 
active student response; overt and covert responding; feedback, 
reinforcement, and knowledge of results; cueing or prompting; 
prompting versus confirmation; interaction of prompting and overt 
responses; organizational and sequencing factors; size of step; self 
pacing of practice; multiple cues; perceptual blueprinting; pre- 
instruction procedures; familiarization; repetition and redundancy; 
sequencing to facilitate discrimination; the role of verbalization; 
and the manipulation of incentive and motivational factors. While 
the language is the language of programmed instruction, the thought 
applies to all forms of the teaching-learning process. Teaching, а8 à 
man-machine, individual-environment system needs to be studi 
in relation to process factors such as these. We are significantly ( 
indebted to Lumsdaine for having pointed up so clearly how key 


BOOK REVIEWS 843 


elements in the psychology of learning can be studied in a teaching 
context, whether the teaching component be a person, an instrument, 
& procedure, or a combination of all three. 

In the preface to the Handbook the editor complains that in recent 
decades research on teaching ^. . . has lost touch with the behavioral 
sciences" and ". . . has not drawn enough nourishment from theo- 
retical and methodological developments in psychology, sociology 
and anthropology.” The chapter on “Social Interaction in the 
Classroom” follows the orbit of teaching as it swings across the 
orbital paths of the mental hygiene movement and group dynamics, 
epitomized in the study of the “social-emotional climate” of the 
classroom. Research on social interaction in the classroom per se is 


treated in terms of interpersonal perceptions of teachers and stu- 


dents, sociometry, and teacher-centered versus learner-centered in- 
struction. Withall and Lewis, the authors of this chapter, bear out 
the editor’s dim view of the extent of past application of behavioral 
Science theory in a closing plea for the injection of theoretical sys- 
tems into research to support the current interests in application of 
concepts from group dynamics, social psychology, psychotherapy, 
and information theory to problems concerned with interaction in 


- the classroom. 


In his chapter “The Social Background of Teaching,” Charters 
draws principally upon the research literature of educational soci- 


Оору to treat such topics as the teacher's position in the American 


Social structure and the consequences of that position on teacher 
effectiveness, the value orientation of teachers, the influence of the 
teaching occupation upon teachers, induction into the teaching occu- 
pation, the community environment of teaching, and styles of ad- 
ministrative behavior as affecting the teacher's role in the school. 
The thoroughness with which these topies are treated is sufficient 
Tecompense, in view of space limitations, for the author's intentional 
exclusion of anthropological and sociological materials bearing on 
educational processes outside the public and secondary school setting 
In the United States. Nevertheless, the limiting emphasis on the 
teacher and his occupational place and role leaves for some future 
treatment the important task of bringing a wider circle of constructs 
4nd theories of sociology and anthropology to bear on the process 
and problems of teaching. | 
If the first three Parts of the Handbook are necessarily general in 
their scope, it is not surprising to find that Part IV, dealing as it 
does with “Research on Teaching Various Grade Levels and Subject 
Matters” is quite specific. The question may well be raised as to 
Whether skeletal considerations must be lost sight of as the flesh and 
blood of teaching practice is envisaged. Theoretically, too, the same 
pasic questions need to be asked whether one is treating research 
the nursery school or at the college and university levels—the 


$4 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


subjects of the first and last chapters respectively of Part IV. 
Reflecting current interests, McKeachie devotes the bulk of hisi 
chapter on college and university teaching to research on teaching: 
methods as such; whereas the behavior of the teacher as she inter- 
acts with children is given considerably greater emphasis in the’ 
discussion of teaching in the nursery school by Pauline S. Sears ani 
Edith M. Dowley. Both chapters which are invaluable in depicting | 
the status of research in their respective fields should prove most. 
helpful to future researchers. 

Overlooking the unfortunate use of the term "subject matters" to 
refer to the several eurrieulum areas in which research is reviewed: 
in the seven remaining chapters of the Handbook, one cannot but 
be impressed with the wealth of stimulating material contained im 
these substantive treatises by highly knowledgeable authors. The 
careful reading of the penetrating and balanced review by Russell 
and Fea of research on the teaching of reading might serve to shame 
many of the current controversialists in this difficult field into $ 
more sober consideration of their subject. Though lacking the psy 
chological approach to teaching method typical of most of the 
writings in the Handbook, Metcalf’s chapter on “Research оп 
Teaching the Social Studies” is significant: affirmatively in remind- 
ing the reader of the contributions speculative thought can make 
research hypothesizing, and negatively in drawing attention to thi 
relative paucity of hard-core psychological studies in the field of 
social studies teaching. The chapter on “Research on Teaching) 
Composition and Literature” by Meckel, as in the case of the one 
about reading, may help to temper extreme, dogmatic views. How- 
ever, if the summary reflects the over-all scope of research in the 
field, researchers on this subject would perhaps do well to broaden 
their horizons to include not only the behavioral sciences, but also 
humanistic studies, 

The treatment by Henderson of “Research on Teaching Secondary 
School Mathematics” gains in originality of presentation what it 
loses in comprehensiveness of coverage. The student of research on 
this subject may have to do more of his own work in surveying 
research on mathematics teaching, but he will find Henderson! 
chapter a source of insightful ideas in directing his examination ¢ 
such research. Watson's review of “Research on Teaching Science,” 
while implying either relative scarcity of research or limited cove 
age in an area where one would expect a plethora of investigation, 
is most interesting in its concluding remarks, which indict much of 
the research in this field. Thus Watson writes “The almost universa 
emphasis upon gains in scores on achievement tests of limited scopi 
is alarming. . . . Most of the studies have seemed to treat the teacher 
in terms of a narrow, stereotyped conception of role, not as an aspir, 
ing human being in a situation rife with tensions." The author sug 


BOOK REVIEWS 845 


gests that the personality characteristics of teachers may be found 
to “. . . take precedence over whatever instructional techniques or 
roles they have learned." In his summary of “Research on Teaching 
Foreign Languages" Carroll returns to a systematic review of the 
subject and reflects both old and new issues and procedures as they 
have been subjected to research. The brevity of the chapter by 
Hausman on “Research on Teaching the Visual Arts” may itself be 
а commentary on a limited emphasis research workers may have 
given to this curriculum area, or it may simply indicate the author’s 
position that, while descriptive and status studies are important, the 
focus of research should be on the “dynamics of the teaching situa- 
tion.” Understandably, the Handbook could not cover all the sub- 
jects of the curriculum. 

Viewed in any large sense, there can be no question but that the 
Handbook is exceptionally rich in potential for achieving an historic 
role in the lifting of research on teaching to new heights. But this 
potential will be lost if individual researchers and research as а 
corporate entity do not capitalize on this promise through intelligent 
strategic planning and tactical operation. Serious attention needs to 
be directed toward purpose, conceptualization, and method as well 
as toward the organization of programs and projects. The collective 
body of research workers owes too much to those who conceived and 

executed this monumental work not to followthrough on its message. 

The milestone achieved by the American Educational Research 
Association when it instituted the Review of Educational Research 
in 1931 would indeed be matched were the H andbook to become the 
prototype of a series of similar handbooks covering major areas of 
educational research. Although the Handbook is not the last word, it 
clearly serves to bring the necessarily somewhat fragmented mate- 
rial contained in the Review of Educational Research to а second 
stage of integration. It might well provide a platform, through the 
media of symposiums of penetrative thinkers in educational re- 
search and related behavioral fields, from which might be launched 
a further excursion in the integration of the theoretical and ap- 
Plicable knowledge necessary for a meaningfully controlled enhance- 
ment of the educational enterprise. 

: HAROLD Н. ABELSON 
The City College, 
The City University of New York 


Multivariate Procedures for the Behavioral Sciences by William W. 
Cooley and Paul R. Lohnes. New York: John Wiley and Sons, 
1962. Pp. 211. 

The authors of this extremely readable book have introduced a 
much needed innovation into the field of computer programming, 
namely assuming the reader is familiar with the fundamentals of 


846 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


FORTRAN programming. Such an assumption enables the authors 
to develop topies unfettered by tedious explanations of the pro- 
gramming language. In addition, the book is not a statistical primer, 
and the authors have assumed at least general familiarity with 
multivariate statistics and their mathematical bases. 

The organization of the book, which was well conceived, develops 
topies in а logical manner beginning with the mathematical basis 
of the technique followed by (a) an application of the technique to 
an appropriate set of data; (b) a gross flow chart of the breakdown 
of the technique for programming purposes; and (c) a complete 
FORTRAN program for the multivariate technique. The basic 
format is applied to the following multivariate techniques: 


Multiple and Canonical Correlation 

Multivariate Analysis of Variance and Covariance 
Multivariate Discriminant Analysis 

Multivariate Classification Procedures 

Factor Analysis 


Although what the authors describe as flow charts in the various 
chapters are mostly equations and FORTRAN statements with 
boxes drawn around them, they redeem themselves by using conven- 
tional flow charting techniques in the appendices. 

In the programs for the multivariate techniques the authors as- 
sume the existence of certain library subroutines which are contained 
in the appendix. It would have been extremely desirable to have 
presented the details of how one develops a program for at least one 
of these, say eigenvalues and eigenvectors, in as great detail as was 
done for the correlation matrix, 

The programs presented which were written for the IBM 709 
reflect some of the limitations of this older computer. For example, 
the matrices employed were limited to 50 X 50, whereas programs 
for the newer machines employ somewhat larger matrices. In addi- 
tion, certain small details were programmed somewhat differently. 
For example, the number of subjects was included as an input 
parameter to the correlational routines. Such a practice leads to a 
lack of operational flexibility ; hence this reviewer prefers to program 
the input so that the computer counts the number of subjects as the 
data are read in to the machine, 

The real significance of this book does not lie primarily in the 
computer programs presented, as most have been available in one 
form or another for some time, but rather in the basic approach 
taken by the authors. At the present time the market is glutted with 
books, primers, and paperbacks on the fundamentals of FORTRAN. 
Any average graduate student can learn these fundamentals in а 
few sessions with such manuals. At this point, however, one is sud- 
denly abandoned. Then, how to apply these fundamentals and how 


BOOK REVIEWS 847 


to build a working FORTRAN program for a statistical technique 
represent a mystery. This book is an excellent text on intermediate 
and advaneed FORTRAN programming, whether the authors in- 
tended it to be such or not. To the best of this reviewer's knowledge 
such superb instruction in the application of FORTRAN funda- 
mentals to real programming problems is unique to this book. 

The second significant aspect of this book is that it was written 
by two persons whose primary commitment is to education rather 
than to mathematics, numerical analysis, or electrical engineering. 
The existence of persons within education with the statistical and 
programming sophistication to write a pace-setting work such as this 
should be a source of pride to those in the field of education. Perhaps 
the authors have caught the computer programmers as off guard as 
Sidney Siegel found the statisticians a few years ago. 

FRANK B. BAKER 
University of Wisconsin 


Introduction to the Statistical Method by Kenneth R. Hammond 
and James E. Householder. New York: Alfred A. Knopf, 1962. 
Pp. 412. 

In the foreword to the book the authors stated that their aim was 
"to write a book which addresses itself to the logic of the statistical 
method for its own sake." This book, although interestingly written 
апа well organized, achieves this goal for only an extremely limited 
portion of the statistical domain, namely large samples. The statis- 
tieal logie of these fundamentals has been amply discussed in num- 
erous other texts. Although sampling distributions and hypotheses 
are finally developed in the last two chapters, the student is not 
provided with sufficient means for employing the laboriously devel- 
oped statistical logie in the real world. If the authors were seriously 
interested in statistical logic, the considerable contributions of Stu- 
dent, Fisher, and Wald would have provided a much better vehicle 
than those of Galton and Pearson. í 

In an attempt to explain descriptive statistics in such a fashion 
that every student can learn the material from the book alone, the 
authors have impaired its usefulness for a number of other purposes. 
The references which are inadequate do not provide a sufficient 
bridge to the larger world of statistical literature. For example: rank 
order correlation is discussed in quite some detail; yet reference to 
Spearman’s work is not given. Similarly Type I and Type II errors 
are discussed on page 349; yet Neyman is never mentioned. When a 
student has finished this book, his only working tools are the 
statistical logic of large sample theory and a few descriptive statis- 
tical methods. Finally the lack of scope possessed by the book would 
give the student little cause to retain it as a reference for the day 
when real statistical problems are encountered. 


848 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


In our present age of desk caleulators and digital computers there 
is little justification for prolonging the agonies of techniques de- 
signed for hand calculation. Little or nothing is to be gained by 
computing means and standard deviations from grouped vs. un- 
grouped data; yet the authors consistently cling to these obsolete 
techniques. 

The positive aspects of the book, easy reading, slow pace, inter- 
esting format, many problems with answers, and an excellent organ- 
ization, are outweighed by the limitation to pre-1920 material (with 
minor exceptions) and by inclusion of topics previously discussed. 
The book would be an excellent high school or junior college text for 
the average ability student, but would be unsuitable at the senior 
college or beginning graduate level because of its limited range of 
topics. 

Frank B. BAKER 
University of Wisconsin 


Tests and Measurements by Leona E. Tyler. Englewood Cliffs, 
N. J.: Prentice-Hall, Ine., 1963. Pp. 116. 

This paperback of little more than one hundred pages is part of 
the “Foundations of Modern Psychology Series,” an alternative 
approach to the single text conventionally employed in an introduc- 
tory course. Written in a clear, interesting style, the book offers 
more content and better perspective than that provided by the omni- 
bus introductory text. In fact the presentation of some topics is 
superior to that encountered in many larger texts written for а 
whole course in testing. 

Perhaps the most difficult tasks confronting the writer of an ele- 
mentary presentation are the choice of content and the adoption of 
а uniform and appropriate level of discourse throughout. Although 
the author has succeeded admirably on both accounts, there are 
inevitably places where her judgment can be questioned. Following 
an elementary introduction which is almost too simple, the remain- 
der of the first, and the second chapters is given over to concepts of 
measurement and statistics which are beyond the scope of the 
beginning student. The section which distinguishes among nominal, 
ordinal, interval, and ratio scales of measurement establishes а 
distinction which bears little relation to the pages that follow; it is 
more likely to distract and to confuse than to edify. Basic statistics 
are presented in the best tradition of such one-chapter presentations; 
this section is much too concentrated for all but the most gifted 
students. Important statistical concepts are frequently presented 
almost casually, with off-hand computational illustrations which 
are not sufficiently explained. An excellent but remarkably concise 
discussion of statistical inference is included. In fact almost an entire 
semester of work is summarized in thirteen pages! 


BOOK REVIEWS 849 


The elements of psychometric theory are presented lucidly, and 
in terms of the most current thinking. Validity is defined essentially 
in terms of construct validity, with examples of empirical and con- 
tent validity cited later within a context of specific testing prob- 
lems. In general psychometric concepts are discussed in relation to 
their utility rather than in terms of abstract principles or rigid 
methodological frameworks. While some teachers may prefer to 
work with a more analytic and disciplined approach to validity and 
reliability, the pragmatic approach of the author has the advantage 
of sparing the student some of the unresolved problems of the field 
while still acquainting him with the most essential concepts. 

The treatment of intelligence and aptitude testing is good, that 
of personality testing superior, in some respects, to what is usually 
found in more advanced textbooks. While adhering to the psycho- 
metrician’s values, the author does express a more tolerant and 
respectful view of the complexities of personality measurement in a 
discussion of both objective and projective tests than do most 
psychometricians. 

None of the serious problems which beset psychological testing is 
overlooked; yet the overall effect is constructive. For a brief presen- 
tation of psychological testing, the book affords an unusually com- 
plete and intelligent statement. 

HERBERT ZIMILES 
Bank Street College of Education 


Statistics: An Intuitive Approach by George H. Weinberg and John 
A. Schumaker. Belmont, Calif: Wadsworth Publishing Co., 
1962. Pp. xii + 338. j "s 

This text joins the ever increasing collection of applied statistics 
textbooks for first-semester undergraduate and graduate courses. 

As a member of this group, the contents of the book are fairly pre- 

dictable: measures of central tendency and variability, the normal 

distribution, elementary hypothesis testing, estimation procedures, 
chi-square tests, correlation, and prediction. A lengthy discussion of 
the central limit theorem and a short chapter on nonparametric 
tests are two distinctive features of Statistics: An Intuitive Ap- 

proach. m 

The title suggests a new twist in the teaching of statistics. The 
presentation of the subject matter bears little resemblance to that 
of other texts in the same class. The pages are filled with exposition 
to the almost total exclusion of derivations and formulae. This text- 
book appears to be another in a series of reactions against “cook- 
book” or “follow-the-leader” texts once prominent in the social 
sciences. The authors propose in the first chapter to justify statis- 
tical methods in the light of common-sense reasoning. “Statistical 
methods” are defined as the “application of common-sense reasoning 


850 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


to the analysis of data." At most points they succeed in their en- 
deavor admirably. Occasionally the arguments they present will 
probably fail to impress beginning students with their common- 
sense nature. It is particularly difficult to make the central limit 
theorem appear to be common-sense reasoning applied to the analy- 
sis of data. However, in general the presentation is thoughtful and 
careful, calculated to impress the reader with the logic of statistics 
rather than with the mechanies, and it is certain to be appreciated 
by those not conversant with elementary mathematics. 

Some errors and inconsistencies in the text will annoy the knowl- 
edgeable reader. On page 6, a random sample is defined without the 
condition that each selection from the population be independent. 
On page 183, it is asserted that the positive square root of the 
unbiased estimator of о? is an unbiased estimator of c, which it is 
not. While well handled in the main, the discussion of the central 
limit theorem is inconsistent at one point. The authors state on page 
117 that if samples are large enough, then the sampling distribution 
of the mean will be nearly normal. The words “large enough” are 
qualified in a footnote as follows: “Nearly always the samples are 
large enough if they contain at least 30 terms each.” (This statement 
appears to arise out of a confusion of the central limit theorem with 
the relationship between the normal and t distributions.) Then on 
page 115, one finds the statement that if the population is “extremely 
unbalanced,” giant-sized samples must be taken before their means 
are normally distributed. 

The absence of important topics such as interval estimation of p 
when о? is estimated from a sample and the power of significance 
tests is a shortcoming. The authors’ choice of the terminology “ac- 
cept” rather than “fail to reject” the null hypothesis seems inadvis- 
able for students of elementary applied statistics in spite of the con- 
ventions of mathematical statistics, 

The exercises which follow each chapter tend to be very repetitive. 

For example, at the end of chapter 14 the student is given the 
opportunity to perform approximately 25 t-tests, all essentially the 
same. Answers to the odd-numbered problems are given. Chapter 6 
on computing statistics from grouped data and by coding is of 
doubtful utility. The tables in the appendix are adequate; the book 
is well indexed. 
t Statistics: Ап Intuitive Approach does have а place in the train- 
ing of social scientists. In addition to serving as the text for those 
courses for which its presentation is appropriate, the book could be 
used as a supplementary text for students recovering from "symbol 
shock" suffered in more traditional courses, Many students deserve 
more than this textbook can offer, however. The mathematically 
literate should not be deprived the economy of time and clarity 
which a more liberal use of elementary algebra would afford. 


| 
| 
| 
| 


BOOK REVIEWS 851 


A quotation from William Blake appears alone on a page at the 
front of the book: 


“What is now proved was once only imagined.” Curiously, virtu- 
ally nothing about elementary statistics is proved in the text. To 
many students, the bare facts about statistics will seem metaphysical 
when robbed of their mathematical beginnings. When the student is 
given only the fruits of the statistician’s labors and never watches 
him at work, those products must seem strange indeed. One can 
imagine a student asking of statistics presented without mathe- 
matics, 


“What immortal hand or eye 
Could frame thy fearful symmetry?” 
— William Blake 
GENE V. GLASS 
Laboratory of Experimental Design 
University of Wisconsin 


Psychological Principles in System Development by Robert Gagné 
(Editor). New York: Holt, Rinehart, and Winston, 1962. Pp. 500. 
$9.00. 

It is by now widely accepted that system design depends on the 
employment of principles and data derived from many disciplines. 
It is also tied to mathematical or other models. All kinds of "give 
and take" between areas are employed. This book shows that psy- 
chology can make a considerable contribution to the team effort. 

Gagné has collected a series of papers covering the various means 
by which psychology can contribute to system design. Not only are 
the papers well co-ordinated to form a unified, logical presentation 
of an approach to design and analysis, but also the approach and 
philosophy are well documented with experimental detail. Stressing, 
as it does, approach and methodology, the volume should be useful 
to students in systems research, irrespective of what their academic 

specialty may be. Й 

In Chapter 2 on human functions and systems Gagné broadly 
discusses systems research structure and plan. Areas, such as range 
and variety of input in man-machine interaction, load-sharing, and 
interaction possibilities fit into a general discussion of key problems 
in the wide areas of machine and human potential, limitations, and 
interactions. In Chapter 3 Ward Edwards covers a wide range of 
computer uses and copes with simulation problems and method. 

Areas such as decision-making, pattern recognition, and information 

retrieval were only sketchily covered for a review on this level. | 

Chapter 4 by Wulfeck and Zeitlin is a compact and detailed 
review of the areas of psychophysics as utilized in systems work, 

This area is one to which psychology makes a real contribution. 


852 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Material of this type, treated from this point of view, needs expan- 
sion into a handbook for use by engineers, physiologists, and others 
who hope to devise systems of maximum efficiency and flexibility. 
This chapter alone makes the book valuable for reference. 

In Chapter 5, Kidd returns to the design problems which Gagné 
sketched broadly in Chapter 2. Ways in which psychologists can 
contribute to the team effort are detailed. Error elimination and 
optimal human functioning are goals. The range and variety of 
processes which need to be explored to achieve this end are described. 

Personnel subsystems, with emphasis on personnel selection and 
training, and the attendant task and function analyses, have been 
strong areas in which psychologists could contribute to systems 
research. In Chapter 6 Miller states the case for task analysis and 
presents a complete discussion of the techniques employed. In Chap- 
ter 7, Horst covers the techniques employed to maximize predictive 
efficiency insofar as personnel classification goes. Chapter 8, by 
Wulff and Berry, on job aids could be expanded to include handbook 
detail and to serve as vital reference material for systems training 
specialists. In Chapter 9 Crawford deals with training-in-systems 
as envisaged for individuals. Covering a wide variety of training 
principles and techniques, Crawford develops the trend toward pre- 
cision and refinement in training objectives as well as the specific 
information and practice exercises which successful system mainte- 
nance demands. Automated instruction is a new world and one that 
is rapidly changing. 

In Chapter 10 Biel describes new techniques in team training 
which are ingenious in the extreme. It begins to look as though con- 
cepts and techniques of training may well be the area in which 
psychology comes to make its greatest contribution. Bogulaw and 
Porter, in Chapter 11, cover team training material, which, to the 
reviewer's knowledge, has been presented nowhere else. Their section 
on high fidelity system simulation is graphic, but could well be more 
comprehensively treated than it is in present form. 

The evaluation of individual system performance is dealt with in 
Chapter 12 by Glaser and Klaus. Here the old problems of difficulty 
in defining behavior to be measured, in reliabilities, and in criterion 
relevance are talked “around rather than through.” More and more 
emphasis will be placed on this area in the near future. Psychology 
can make a really great contribution to system design in the evalua- 
tion area. In Chapter 13, В. Н. Davis and Behan continue the dis- 
cussion with emphasis on simulation design. Useful distinctions аге 
made between individual learning and systems learning. 

The book closes with a discussion by Finan about ways in which 
systems research must differ from theoretical research. In such areas 
as choice and definition of problems, uses of analogy and of hypoth- 
eses, definition of units, and uses of statistical inferences, he advances 


BOOK REVIEWS 853 


arguments for real departures from the conventional approach to 
meet the demands of the practical situation. 

Gagné's collection of papers reveals the extent of contributions 
psychology has been able to make in training, in psychophysies, in 
personnel selection, and in design analysis. Problems of criteria in 
evaluation of performance and of research methodology need further 
study. Design of successful systems depends on the model makers, 
but models must be related to practical situations through the media 
of data from a number of disciplines. 

This book shows how considerable is the contribution which psy- 
chology can make. It describes the contribution with fidelity and 
accuracy, and the results are a logical coherent delineation of a 
philosophy, an approach, and an impressive body of empirical 


results. 
JOSEPH G. PHELAN 
Los Angeles State College 


Concept Learning: An Information Processing Problem by Earl B. 
Hunt. New York: John Wiley & Sons, Inc., 1962. Pp. viii + 286. 
$7.50. 

In recent years, especially, most of us have learned that really 
significant research problems do not respect the traditional boun- 
daries of academic disciplines. The subject matter of Hunt’s Concept 
Learning reinforces this characteristic. In the closing chapter Hunt 
writes: “Concept Learning has been presented as a topic for logical 
analysis, a behavior to be explained by psychology, and as a desired 
capability of intelligent automata." His book is a review and inte- 
gration of work in these three fields from the standpoint of concept 
learning. Since until very recently few if amy attempts have been 
made to integrate work in these fields, many researchers will proba- 
bly find the review somewhat jarring when viewed from their own 
assumptions. In addition, let the reader be warned that comprehen- 
sion of some of the ideas presented in this book will be no easy after- 
dinner affair. This is especially true in the first half of the book 
where the theoretical issues relating concept learning to other human 
learning can probably be understood only by persons thoroughly 
familiar with these fields. That the book is difficult reading is in 
‚ part a reflection of the difficult subject matter and in part a function 
of the author’s complicated style of presentation. 

On the organization side, there is an extensive bibliography at 
the end of each chapter. There is an abbreviated but satisfactory 
subject index as well as a much needed author index. On the nega- 
tive side, footnotes in the form of comments are accumulated at the 
end of each chapter. The reviewer found himself reading each chap- 
ter with a marker at the end of the chapter to permit easy access to 
the footnotes. This cumbersome task could be eliminated if the 


854 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


footnotes were incorporated into each chapter at the appropriate 
places. Since the comments are usually an integral part of the dis- 
course, it would be better if they were made a part of the text. 

The book although organized into ten chapters, might better be 
described as consisting of two parts. The first part is a very thorough 
review of human concept learning. Those familiar with the psy- 
chological research literature on concept formation will recognize 
the influence of the late Carl Hovland with whom the author was 
associated for several years. 

First, Hunt develops a definition of “concept” and relates this 
definition to formal logic. He restricts the term “concept” to defini- 
tions which can be expressed in symbolic logic. A concept is the 
name of a symbol which refers to an object or entity. Concept learn- 
ing is the learning of names. While in some psychological research 
other (often ill-defined) definitions of concept learning are used, 
Hunt’s definitions make possible the explication of concept learning 
tasks and permit the preparation of computer programs to model 
some of these operations. 

Chapter 3, “Concept Learning and Basic Learning” is really a 
tour de force. In it Hunt gives an extensive review and critique of 
the major contemporary theories of learning from the perspective 
of concept learning. As in several other recent books, there is dis- 
satisfaction with attempts to pateh up or make ad hoe modifications 
{о existing S-R models to permit extension to the more complex 


behavior of concept learning. As a better solution Hunt suggests à 
redefinition of the response: 


Is it unreasonable to assume that an adult human has learned 
several large “information processing units” which are specifically 
designed for the manipulation of an internally represented, sym- 
bolically coded environment? By using them the human could 
construct an internal model of the environment, one that could 
be used to predict external events. Such a conception of thought 
has been expressed by Campbell and, in particular, in the TOTE 
unit analogy discussed in a speculative monograph by Miller, 
Galanter, and Pribram. In an extended learning theory model 
these routines would be considered as responses. As such they 
would be held accountable to the laws of learning. Presumably 
they could be broken down into smaller more unitary S-R con- 
nections. This should not be done, because the smaller responses 
would not occur in isolation in the adult. The stimulus of having 
completed one step in an information processing procedure should 
be a sufficient stimulus to elicit the occurrence of the next step. 
Only at a very few “choice points” would the stimuli arising from 
the results of information processing be examined before deter- 
mining further responses. 


P. 


BOOK REVIEWS 855 


The remaining three chapters in the first part review research on 

human concept learning within the perspective of Hunt’s conception 
of data processing units or routines. These chapters are “Stimulus 
Organization,” “Memory and Concept Learning,” and “Strategies of 
Concept Learning.” 
_ The second part of the book discusses concept formation as rea- 
lized through mechanieal means and especially computer programs. 
A chapter entitled “Artificial Intelligence” introduces and surveys 
the current status in developing mechanisms or computer programs 
which perform “intelligent” tasks. It is in a chapter entitled “An 
Information Processing Model of Concept Learning” that the re- 
viewer feels the author communicates the most effectively about his 
subject. This is not surprising, since here he describes a computer 
program for concept learning which he has developed. In describing 
the program there is much information about the substance of his 
own theory. Also, he illustrates how the interplay between results 
from experiments on human concept learning and the results from 
computer models can illuminate the problems of both areas. 

Finally, he surveys a number of artificial models in relation to the 
problems of concept formation. These models include Pandemonium 
by Selfridge, a pattern recognizer by Uhr and Vossler, Perceptrons 
by Rosenblatt, and an algorithm for educated guesses by Kochen. 
A final chapter surveys the prospects for concept learning, especially 
through computer models, for the next few years. 

Certainly there is need to integrate some of the work on human 
learning with developments in formal logic and computer processing. 
Whether this book fills that need is a matter of opinion. The inte- 
gration can be and to some extent has been achieved through other 
research problems, such as the work on problem solving by Newell 
and Simon and others. Some psychologists will find Hunt’s defini- 
tions too restrictive. In this reviewer’s opinion Hunt is to be com- 
mended for making explicit his definitions of concept learning and 
for extending their application from experiments on human concept 
learning to the development of computer models. Those who wish 
to expand his definitions should gain a clearer understanding of some 
of the problems which they face. The reviewer believes that those 
who approach this book as a working statement of a researcher who 
has attempted to extend our understanding of complex human be- 
havior will be richly rewarded for their efforts. 

Joun D. Fonp, JR. 
System Development Corporation 
Santa Monica, California 


Psychological Research, An Introduction by Arthur J. Bachrach. 
New York: Random House, 1962. Pp. 113. $1.45. 
This short, paperback book was prepared for the Random House 


856 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


series entitled Studies in Psychology. The author, Arthur J. Bach- 
rach, is a member of the Psychology Department at Arizona State 
University. The avowed purpose of the book is to “share ideas about 
research, dispel a few misconceptions about the tedium of research, 
and interest some students into looking into the matter further.” In 
conformity with this aim of treating a rather formidable subject 
informally, the writing is lucid and colloquial. Although the content 
is largely anecdotal, some very important topics related to research 
are touched upon. These include Characteristics of Research (Chap- 
ter II); The Fundamental Methods of Research: The Formal Theo- 
retical and the Informal Theoretical (Chapter III), The Problem of 
Definition (Chapter IV), as well as some general considerations (in 
Chapter V) dealing with animal and human research, ethical con- 
siderations in research, and professional standards. 

It is perhaps unfair to characterize the book as being shallow, for 
it was not meant to be an exhaustive treatise. Nevertheless the addi- 
tion of some more substantive material would have been welcome. 
There is no index, and the bibliographic references do not include 
such standard textbooks on experimental psychology as those by 
Boring, Koch, Osgood, and Stevens. 

The book can be read quickly and with enjoyment. It can serve а 
useful purpose in introducing research as a pleasant and stimulating 
occupation, and thus can help the student to develop a positive atti- 
tude toward the task. It will then be up to the instructor to explain 
the methodology of research. 

HAROLD Вовко 
System Development Corporation 
Santa Monica, California 


The Behavioral Sciences Today by Bernard Berelson. New York: 
Basie Books, 1963. Pp. viii -+ 278. $4.95 

The twenty papers which make up this volume were originally 
prepared for radio presentation in the forum series of the Voice of 
America. This fact, for better or worse, affects both the style and 
content of the papers. In a twenty or twenty-five minute oral presen- 
tation to a mixed radio audience one cannot get involved in a very 
technical discussion; nor can one be pedantic and list references 
and footnotes. On the other hand, because the papers are to be 
broadcast to a large audience through the facilities of the Voice of 
America, excellent speakers can be obtained and considerable time 
and effort can be spent in organizing and preparing the material. 
The result is a highly uniform, if not very technical, collection of 
papers by some well-known scholars including Cora DuBois, Ernest 
R. Hilgard, Carl I. Hovland, Harry F. Harlow, George A. Miller, 
Paul F. Lazarsfeld, and others. 

The book is well organized. Berelson in Chapter 1, Introduction 


ee — 9 


BOOK REVIEWS 857 


to the Behavioral Sciences, emphasizes the interdisciplinary ap- 
proach and compares the new concept of behavioral science to the 
older notion of social science. He defines the objective as being 
* . . to establish generalizations about human behavior that are 
supported by empirical evidence. . . ." This introduction is followed 
by four chapters dealing with the university setting and with the 
present interests of anthropology, psychology, and sociology. Next 
there are chapters on research methodology including an excellent 
discussion by Hovland on the role of computer simulation in the 
behavioral sciences. These are followed by a group of chapters deal- 
ing with specific topics of interest to behavioral scientists such as 
animal study, language and linguisties, political behavior, and social 
demography. The concluding three chapters are concerned with the 
applications and contributions of behavioral science to the recog- 
nized professions and to contemporary life in general, plus an inte- 
grative review. 

The papers which are introductory are oriented toward the student 
and to the general public who wish to know what is being studied as 
behavioral science in American universities. The professional, and 
even the advanced student, will find the discussion somewhat too 
general and the lack of references annoying. Nevertheless, the book 
does accomplish what it has set out to do—namely, to “provide a 
reasonable sample of the present status of research work” in the 
behavioral sciences today. 

HaroLD BORKO 
System Development Corporation 
Santa Monica, California 


Intergroup Relations and Leadership by Muzafer Sherif (Editor). 
New York: John Wiley & Sons, Ine., 1962. Pp. xiv + 284. $7.95. 
This book contains articles by fifteen contributors which deal with 
theory and research in the field of intergroup relations, and with the 
practical implications of this material. These papers, which were 
originally presented at the fourth interdisciplinary lecture series 1п 


social psychology at the University of Oklahoma (held in April 


1961), were revised for inclusion in this book. 

The material in this volume includes articles by well-known psy- 
chologists, social psychologists, sociologists, anthropologists, and 
political scientists. Such an interdiseiplinary approach to the prob- 
lem area in question provides the reader with a broader orientation 
than is usual in а volume dealing with а specific type of behavior. 
One of the major impressions one receives from reading these articles 
is that, in analyzing intragroup and intergroup phenomena, workers 
with such diverse orientations are moving toward а common ground. 
The underlying theme in all of these papers 1s the explanation and 
conceptualization of such phenomena as “part processes of the inter- 


858 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


group system in question," which is, as Sherif suggests, the distine- 
tive feature of this book. 

One very refreshing and promising aspect of these papers is that, 
in addition to а common frame of reference, they employ the same 
or similar concepts in dealing with several different types of inter- 
group phenomena (i.e. labor-management relations, international 
relations, and ethnie and community-cultural problems). The foun- 
dations of a comprehensive, unified theoretical approach for dealing 
with this broad content area are explicitly and implicitly evident in 
this material—something which is seriously needed in most areas of 
social scientific inquiry. Thus, the introduction of supra-ordinate 
goals into an intergroup conflict situation is shown to be very effec- 
tive in reducing such conflict in several different specific situations. 
Also, the ideas and concepts employed in analyzing phenomena 
within each of the specific areas dealt with (e.g., labor-management 
relations) are doubly thought-provoking because one can see their 
applicability to situations encountered in other types of intergroup 
relations which are not explicitly discussed by these papers. 

Each of the articles in the book is well-written, thought-provok- 
ing, and introductory of fruitful new ideas, and each is of value in 
its own right. Several papers are especially noteworthy. One is 
Sherif's introductory discussion of empirically-based generalizations 
concerning intra-system and inter-system phenomena, which sum- 
marizes relevant principal research findings. Blake and Mouton’s 
analysis of the dynamics of win-lose conflict situations, which is 
insightful and provoking, contributes a great deal to the under- 
standing of causal relationships and motivational phenomena in sit- 
uations where two groups each seck to win a conflict by defeating 
the other. Also of considerable interest is Faris’ discussion of social 
attitudes and the mechanisms by which intergroup hostility is pro- 
duced. Finally, the “institutional” analysis of Negro leadership in 
the continuing desegregation crisis, by Killian, aids considerably in 
understanding some of the problems in this area. 

In general, the reciprocal relationships between what occurs within 
а group and what the nature is of its relations with other groups 
are stressed throughout; an integral part of the intra-system strue- 


ture is leadership, which is treated by several authors from this | 


frame of reference. Likewise, the necessity of approaching these _ 
phenomena. from several levels of analysis or interaction (e.g. 
individual, intragroup, intergroup) simultaneously in order to attain 
a comprehensive theory regarding their nature and consequences is 
evident throughout the book. One such attempt to interrelate these 
levels is presented by Stogdill in Chapter 3. » 

Although this book has no direct relationship to problems of 
measurement, it is highly recommended to everyone who is interested. 
in intra-system and/or inter-system (group) phenomena. It is of 


BOOK REVIEWS 859 


importance not only for its presentation and discussion of factual 
and research material, but also for the interesting and valuable 
conceptual-analytical materials which are presented. Moreover, be- 
cause intergroup relations are at the nucleus of many of our most 
serious national and international problems, this volume should be 
of great value to educators and scientists alike. 

JOHN DELAMATER 

University of Michigan 


The Culturally Deprived Child by Frank Riessman. New York: 
Harper and Brothers, 1962. Pp. xi + 140. $3.95. 

Social scientists should find this short volume a substantial con- 
tribution toward the study of educational and testing problems re- 
lated to lower socio-economic and “culturally deprived" groups. 
More specifically, The Culturally Deprived Child is concerned with 
the recently recognized problem of the educational neglect of these 
groups, which arises from explicit as well as implicit diserimination 
by current school, testing, and teaching practices, and curricula. 

Why the lower class Johnny cannot read, cannot compete with his 
middleclass counterparts, and cannot become interested in school, are 
problems dealt with by the author from a cultural conflict approach. 
Particular cultural characteristics of these large urban groups are 
brought to light and examined in terms of the resulting attitudes and 
behavior toward education, toward the school, and toward the learn- 
ing process itself. Although much of the author’s cultural informa- 
tion is apparently derived from his own resources, it is pointed out 
that the strict traditionalism, anti-intellectualism, pragmatism, and 
external projection characteristic of these groups, lead to educational 
goals and motives quite apart from those shared by middleclass 
educators. Discrimination in the forms of highly verbal tests, cur- 
ricula predominantly concerned with symbolic and abstract learning, 
and the intolerant or patronizing attitudes of teachers, occur because 
of the wide differences in values and goals between the deprived 
child, the teacher, and the system. In addition to the more obvious 
effects of this conflict, it is pointed out that children possessing what 
the author terms “one-track creativity"—that is, children gifted in 
only one direction or area—are seldom discovered and mainly ig- 
nored. 

Of particular interest is the attention given to the problems in- 
volved in the testing of deprived children for educational purposes. 
Current aptitude, achievement, and IQ tests are oriented toward 
highly verbal, “convergently ereative" children with. somewhat 
sophisticated test-taking skills. While the author’s criticisms of the 
various tests available for use by the educator (including the so- 
called culture-free tests) are, for the most part, quite justified, the 
suggestion that educational testing be suspended in favor of teacher 


860 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


evaluation of ability and talent appears to be impractical, particu- 
larly in view of the author's own comments regarding teacher bias. 
The concluding chapter incorporates suggestions for action and 
remedial techniques, which are thought provoking and well thought 
out ideas on problems in which much careful research is desperately 
needed. 
ManquisA DELAMATER 
University of Michigan 


Negro and White Youth: A Psychological Study in a Border-State 
Community by Albert J. Lott and Bernice E. Lott. New York: 
Holt, Rinehart and Winston, Inc., 1963. Pp. 236. 

The major aim of this study was to investigate the values and 
goals of Negro and white youth as well as their educational and 
vocational preferences. Two objectives were planned: a) to test the 
senior classes of four high schools and b) to interview some of the 
leaders from each school. Two of the schools were in the county of 
Fayette, part of the Commonwealth of Kentucky, and two were city 
schools in Lexington, Kentucky. 

The origin of the study came from the Kentucky Council on 
Human Relations which decided on a “merit employment” program 
intended to influence the community to hire people on the basis of 
their ability rather than on color. The Council wanted to know of 
the availability of senior high school Negroes who might merit the 
kind of employment the Council had in mind. When answers to à 
questionnaire had been received from three-fourths of the Negro 
seniors (44 girls and 47 boys) in a community, the information dis- 
closed that many students intended to leave home in search of better 
opportunities. One of the major reasons they gave was job discrimi- 
nation. Reasons related to financial aspects of a college education 
were also cited. Consequently a more intensive survey was planned 
for the purpose of finding comparable data on Negro and white 
youth to obtain some idea of motivations underlying their plans for 
the future. 

A total of 116 Negro students was studied: in School A (county) 
there were 23; in School C (city) 93. Of the total of 185 white stu- 
dents studied, 79 were in School B (county), and 106 were in School 
D (city). Their ages ranged from 16.8 to 17.9 years; I.Q. scores 
ranged from 71 to 134 with a mean of 106.5 and a median of 109. 

The following instruments were used: 1) Goal Preference Inven- 
tory, 2) Modified Form of the Study of Values, 3) Background and 
Outlook Questionnaire, 4) Leadership Poll, and 5) Test of Insight. 

The Goal Preference Inventory measured three needs: 1) academic 
recognition, 2) social recognition, and 3) love and affection. Some 
modifications were made on this test since it was originally designed 
to be used with college students. A modified form of the Study of 


———  —— + -——————р 2. — وق‎ 


*»—— 


2 


x 


BOOK REVIEWS 861 


Values originaly prepared by Allport, Vernon, and Lindzey was 
used to measure predominance of certain motives in personality. The 
classification was based on the following six characteristics of 
Spranger's personality theory: 1) Theoretical, 2) Economic, 3) Es- 
thetie, 4) Social, 5) Political, and 6) Religious. The Background and 
Outlook Questionnaire consisted of 25 questions that were intended 
not only to obtain information on the student's parents, education, 
and occupation of parents, but also to ascertain the student's future 
plans. The Leadership Poll required the students to list 10 outstand- 
ing seniors in terms of popularity and their contributions to the 
school. This test was done anonymously. A modified form of a test 
of insight that had been prepared by Elizabeth French was employed 
to obtain measures of achievement and affiliation motives by having 
the students give an explanation for the behavior of hypothetical 
individuals. 

The findings of the study seem to indicate that Negro youth are 
not higher on esthetic or artistic interest than are the white youth, a 
stereotype which has been fostered in literature. The Negro youth 
in this study also did score higher on the need for academic recogni- 
tion than did the white youth who scored higher on the need for 
Social recognition and for love and affection in the Goal Preference 
Inventory. The authors offer a reason for this seeming contradiction. 
While past writings on the subject reveal that some Negro youth feel 
that education is the great force behind mobility, an important 
essential concerning Negro efforts in this direction would be whether 
or not they have had the opportunity to observe that this rise to the 
“good life" can actually happen to the Negroes who have tried to 
educate themselves. " 

The students were asked to place themselves in a socio-economic 
category such as upper, middle, or working class. They were given 
these labels with no accompanying description of what these implied. 
More than half of the Negro seniors placed themselves in the work- 
ing-class category; only 17 per cent of the whites did this. Thirty 
six per cent of the Negroes classified themselves as middle class 
along with 57 per cent of the white youth. у 

Among the leaders studied, 30 were white students (16 girls and 
14 boys) and 28 were Negroes (11 girls and 17 boys). They revealed 
differences from the total sample of seniors in background character- 
istics. The white leaders had non-Southern background; that is, they 
had not lived in the South so long as the other students. The white 
students differed from the other white seniors in that they came from 
higher status homes, as suggested by the occupation of the bread- 
winner; they also spent less time watching television. The Negro 
leaders differed from other Negroes in the sample in that they had 
fathers who had attained higher levels of education and had earned 


better incomes. 


862 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Both the Negro and white leaders were higher on measures indi- 
eating religious or spiritual experience; they were more interested 
in understanding people and less concerned with economic aspects 
of life. In general, data from the Goal Preference Inventory sug- 
gested that the Negro and white leaders were the same in that they 
needed to be accepted by others and to feel part of à group more 
than they preferred status achieved through academie competence. 

The interview schedule was planned to elicit data on the students’ 
dominant values and their attitudes about goals, especially those 
relating to educational and vocational plans. When asked which they 
considered to be more important, achievement in academic or non- 
scholastic areas, more Negroes than white leaders felt it was more 
important to achieve academically. An interesting response from a 
white leader which typified another orientation from about half the 
white student leaders was: 


Your personality is the main thing that determines how suc- 
cessful you are. No matter what kind of studying or what kind 
of activities you get in, if you have a good personality I think 
you're going to be pretty successful, and just whatever you 
choose to do . . . study or go out for track, I don’t think it will 
make too much difference. (p. 109) 


Twenty-two dominant life-goals were inferred by judges in terms 
of the subjects’ interview responses. The goal “happiness,” defined as 
seeking a feeling of contentment, joy, or satisfaction with the gen- 
eral tenor of their lives, rather than interest in achievement” (p. 
123), was more often expressed as important to white leaders. This 
goal along with religion seemed to make up a basic motivational 
value orientation for these leaders, The Negro leaders, on the other 
hand, seemed to be interested mainly in respectability and financial 
security. Respectability was defined for the judges as “individuals 
who expressed desires to be law-abiding, moral, virtuous, or proper” 
(p. 124). Financial security referred to “the desire for enough money 
or property or both to insure comfort and a minimum of financial 
worry” (р. 124). 

The present study does not support the conclusions of many other 
previous studies which have suggested that American Negroes place 
great value on education since they see it as a major source for 
economic and social advancement. The authors conclude that al- 
though the Negro youth may be constantly reminded by his parents 
and teachers that education is important in order to “get anyplace” 
in life, he does not act as if he believes it. 

The writers make this statement on the basis of two points: “1) 
We have found no consistent evidence to support the view that the 
Negro youth value education any more than do white youth and 2) 
We have found some evidence that the attitudes of Negro youth 


-А- 


BOOK REVIEWS 863 


toward education are more ambivalent than those of white youth” 
(p. 131). They go on to say that “among the conditions which serve 
to push Negro youth off the course on which their parents and 
teachers are trying to guide them is that there still exists, for the 
Negro student, a relative paucity of models who have ‘made good’ 
as a result of studying hard, getting good grades, and being ‘smart’ 
in school” (p. 133). Temptations of immediate rewards such as a 
pleasant social life or participation in sports seem to be stronger 
than motivations required for the discipline in academic achievement. 
The authors suggest that for the Negro more interferences occur in 
relation to their educational goals because they are not so positive 
that those goals are what they really want. Negro students cannot 
be so sure as the white students that their efforts for scholastic com- 
petence will be rewarded when they get out of school. Consequently, 
movements toward the goal, which go forward and back, are easily 
sidestepped because of an uncertain promise of success. 

The pursuit of the Negro student leaders for goals related to con- 
crete, external symbols of achievement (as being of greater im- 
portance to Negroes than to the white leaders) may be a reflection of 
background differences. The white students seem to have parents 
who have attained to a degree the respectability and financial se- 
curity that some of the Negro students appear to want; thus the 
white student may be freer to pursue goals of an abstract and more 
personalized nature. Negro leaders seemed to be more concerned 
with community approval and with being of service to others than 
were the white leaders. These, in turn, seemed less preoccupied with 
publie indieations of leader propriety of behavior and more con- 
cerned with achieving “psychological independence." 

In summary there seems to be consistently greater economie satis- 
faetion and stability in the home setting among the white than 
among the Negro students. Many young Negroes indicate intentions 
to leave home in search of better job placement, educational oppor- 
tunity, and economic advantages. 

The authors suggest that while Negro youth have been thought to 
be unrealistie in the past in terms of their position in society, they 
do not seem so today. The Negro culture seems to have changed 
since World War II. Those urban students with formal high sehool 
background seem more knowledgeable than previously about what 
society has in store for them. They are eager to apply skills learned 
at school. 

It is also suggested that more Negro “models” of those who had 
attained success as а result of educational competence might be 
made available to Negro youth to the end that they might see that 
efforts made toward effective study and discipline are worthwhile. 
The authors realize that this would involve large scale social 
changes; subsequently their suggestions on this aspect are weak 


864 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


even though the points made on the beneficence of the idea are | 
strong. 

A statement is made to the effect that even after students of lower 
status achieve higher positions in society their earlier self percep- 
tions remain. Consequently they never forget their low caste back- 
ground. Where one statement: offers a possible solution for the 
worthwhileness of the Negro's making the attempt to improve his 
station in life by means of education, the latter statement suggests 
the futility of such efforts. 

Somehow the final chapter dealing with interpretations, cons 
clusions, and “final thoughts" fell short of the authors’ earlier dili- 
gent treatment of the problem. In spite of this weakness, however, 
the book is a worthy contribution to the psychological and sociologi- 
cal studies on American youth. } 

Ерүтню MARGOLIN ; 
University of California, _ 
Santa Barbara 


Educational Psychology (Second Edition) by Lee J. Cronbach. New 
York: Harcourt, Brace & World, Inc., 1963. Pp. xxvii + 706. 

Cronbach’s second edition of Educational Psychology is a sound 
presentation of psychological facts and principles bearing on educa- 
tional practice likely to be of value to students in introductory edu- 
cational psychology. The book seems to have been written with the 
view that the introductory course will precede classroom teaching 
by some period of time, and this orientation necessitates therefore, 
unusual emphasis on psychological principles. In most cases these 
principles are well buttressed by such relevant supporting research 
as is now available. 

Although some organizational changes have been made, and, as 
one might expect, considerable updating has been done, on the 
whole the book retains the previous format and highly readable 
style. The text’s 18 chapters comprise five broad sections: (1) Psy- 
chology and School Problems; (2) Readiness and Its Development; 

(3) Acquiring Skills, Ideas, and Attitudes; (4) Planning, Motiva- 
tion, and Evaluation ; and (5) Emotional Learning. Within these 
sections there is coverage of the usual areas included in such texts: 
development, learning, measurement, mental hygiene, and factors 
presumed to relate to these broad categories in school situations. 
Section 1, with its focus on the objectives of education and on 
possible sources of student bias in approaching the subject matter of 
educational psychology is especially noteworthy. 

A unique contribution is Cronbach’s attempt to capture the flavor 
of the factors influencing school learning by presenting an integra’ 
view of development and learning. Such an approach emphasizes 
the wholeness of the individual and the interrelatedness of learn- 


BOOK REVIEWS 865 


ing, development, and evaluation. However, there is the possibility 
of omitting large segments of relevant materials too cumbersome for 
the author to synthesize. Also, such an approach increases the 
likelihood that the author might inadvertently do much of the 
synthesis and integration for the student rather than have the stu- 
dent do this himself. It is perhaps with these limitations in mind 
that Cronbach intersperses within each cliapter questions which ac- 
tively involve the student with the subject matter of that section. 
This attempt at creating student involvement within the individual 
section seems successful. 

Although this reviewer would quarrel with Cronbach's view of 
the vastness of research and theory available to teachers and 
prospective teachers, especially where complex behavior in school 
situations is concerned, there can be no question that he has, like 
most good textbook writers, judiciously extracted from the extant 
research literature those basie principles likely to be useful to 
teachers. 

REGINALD L. Jones 
Miami University (Ohio) 


Psychology in Teaching Reading by Henry P. Smith and Emerald 
V. Dechant. Englewood Cliffs, N. J.: Prentice-Hall, Inc., 1961. 
Pp. x 4- 470. $7.50 

'The major goals for this book, according to the authors, are to 
select psychologieal data that are relevant to the teacher's under- 
standing of the reading process, to interpret these data, and to apply 
the interpretations to specific classroom problems. Their selection 
and presentation of the extensive research in each of several perti- 
nent fields makes this book an important contribution. 

The topics which are explored most effectively are perception, 
learning principles, reading readiness, sensory processes, and cerebral 
factors. Terms are defined with clarity and precision: “Reading is 
the perception of graphie symbols. It is the process of relating 
graphie symbols to the reader's fund of experience" (p. 44). 

Bibliographies are extensive, drawing from such varied sources as 
Hinselwood's Congenital Word-Blindness (p. 163) and Hymes' Pro- 
duction in Advertising and the Graphic Arts (p. 266) for recom- 
mendations on type sizes. Primary sources are used extensively, but 
not exclusively. 

During the preparation of this book, Smith and Dechant purport 
to have examined 2500 studies, which on the whole were reviewed 
with care and detachment. The reader would have been helped by 
the insertion of the publication date in the text of research sum- 
maries, since these were presented according to topie, rather than by 
chronology. The authors cite two Harris studies on lateral domi- 
nance, separated in the text by а Hildreth summary which indicates 


E 


$ \ D 
' 866 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT ; 


` “по evidence for a causal relationship between handedness and sue: : 
cess in reading" (pp. 168-69). The reader must refer to the bibli- 
ography to learn that Hildreth’s summary was published prior to 
. both of the Harris reports. E 
More incisive evaluation of the studies cited could reasonab., 4, y 
- expected from authors who purport to interpret the data. The task 
` Of reviewing so many studies is monumental; however, the "students, 
parents, teachers, and administrators" for whom the book 9) 


written may not have khe primary sources at hand, nor the back- 
ground to evaluate rch design underlying each study, The in- 
terpreter either must present studies which are qualified, or musti 
assume the more difficult task of evaluation. On the Wickman s 
(р. 296), for example, the omission of essential aspects of % 
searcher’s purpose and procedure leaves the reader with possi 
misconceptions regarding the findings. ring | 
An attempt was made to help the reader understand the statistic 

limitations of group tests. One such explanation was the use of 
_ standard error of measurement to determine the probable acc 
of a child’s score (p. 416). This discussion is too brief to be useful i 
one who lacks background in measurement or statistics. 
_ The third goal, concerning applications and interpretations rela ^ 
tive to specific classroom problems is far from attained. The chap ers 
which discuss reading programs are drawn from sources twice 
more removed from the classroom setting. Whenever a writer undei 
takes to treat topies such as “Comprehension and Rate Skills” ¢ 
“Diagnosis and Remediation” in less than 30 pages each, he guaran 
tees the reader something less than thorough treatment. # ~ | 
Тһе major contribution of Psychology in Teaching Reading is its 
provision to reading teachers of information concerned with psy ¢ 
logical principles which help them understand the reader апа 
reading process: Although it seems to the reviewer that the autho 1 
attempted to undertake moré than they could treat adequately in 


one volume, nevertheless the positive features of the text far out- 
weigh the negative ones. К 


on 
MILDRED RoBECK È 

' University of California, | 
Santa Barbara. А 


